Matched via arXiv identifier search · Partial overlap with paper title keywords
- Stars
- 21
- Last push
- Jun 9, 2026 (11d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- Low confidence match
Seungone Kim, Dongkeun Yoon, Kiril Gashteovski, Juyoung Suk, Jinheon Baek, Pranjal Aggarwal, Ian Wu, Viktor Zaverkin, Spase Petkoski, Daniel R. Schrider, Ilija Dukovski, Francesco Santini, Biljana Mitreska, Yong Jeong, Kyeongha Kwon, Young Min Sim, Dragana Manasova, Arthur Porto, Biljana Mojsoska, Makoto Takamoto, Marko Shuntov, Ruoqi Liu, Hyunjoo Jenny Lee, Niyazi Ulas Dinç, Yehhyun Jo, Sunkyu Han, Chungwoo Lee, Huishan Li, Esther H. R. Tsai, Ergun Simsek, Khushboo Shafi, Yeonseung Chung, Jihye Park, Aleksandar Shulevski, Henrik Christiansen, Yoosang Son, Elly Knight, Amanda Montoya, Jeongyoun Ahn, Christian Langkammer, Heera Moon, Changwon Yoon, Nikola Stikov, Mooseok Jang, Edward Choi, Junhan Kim, Yeon Sik Jung, Woo Youn Kim, Jae Kyoung Kim, Ishraq Md Anjum, Hyun Uk Kim, Drew Bridges, Carolin Lawrence, Xiang Yue, Alice Oh, Akari Asai, Sean Welleck, Graham Neubig
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence. Understanding what AI reviewers do well, where they fall s ...
hort, and what challenges remain is essential. However, existing evaluations of AI reviewers have focused on whether their verdicts match human verdicts (e.g., score alignment, acceptance prediction), which is insufficient to characterize their capabilities and limits. In this paper, we close this gap through a large-scale expert annotation study, in which 45 domain scientists in Physical, Biological, and Health Sciences spent 469 hours rating 2,960 individual criticisms (each targeting one specific aspect of a paper) from human-written and AI-generated reviews of 82 Nature-family papers on correctness, significance, and sufficiency of evidence. On a composite of all three dimensions, a reviewing agent powered by GPT-5.2 scores above each paper's top-rated human reviewer (60.0% vs. 48.2%, p = 0.009), while all three AI reviewers (including Gemini 3.0 Pro and Claude Opus 4.5) exceed the lowest-rated human across every dimension. AI reviewers' accurate criticisms are also more often rated significant and well-evidenced, and surface a distinct 26% of issues no human raises. However, AI reviewers overlap far more than humans do (21% vs. 3% for cross-reviewer pairs), and exhibit 16 recurring weaknesses humans do not share, such as limited subfield knowledge, lack of long context management over multiple files, and overly critical stance on minor issues. Overall, our results position current AI reviewers as complements to, not substitutes for, human reviewers.
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
With the advancement of AI capabilities, AI reviewers are beginning to be deployed in scientific peer review, yet their capability and credibility remain in question: many scientists simply view them as probabilistic systems without the expertise to evaluate research, while other researchers are more optimistic about their readiness without concrete evidence.
Recommendation evidence is currently too limited for a maintained-repo choice. Use Implementation Status and Reproduction Path for a practical baseline plan.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence graph: 2 refs, 1 links.
Utility signals: depth 60/100, grounding 58/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Partial overlap with paper title keywords
Risk flags
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
Hardware requirements
No verified implementation available
No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Models
Datasets
Spaces
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Agentic tool use
Methods
Agentic systems
Domains
AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.