Anna Feldman, Libby Barak, Jing Peng · Feb 28, 2026
Researcher Tools
Human Feedback and Eval Paper Explorer
A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.
Filter by tag
Zewei Yu, Lirong Gao, Yuke Zhu, Bo Zheng, Junbo Zhao, Sheng Guo · Feb 12, 2026
Noam Razin, Zixuan Wang, Hubert Strauss, Stanley Wei, Jason D. Lee, Sanjeev Arora · Mar 19, 2025
- The success of Reinforcement Learning from Human Feedback (RLHF) critically depends on the quality of the reward model.
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026
- Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- GR achieves a higher GPT-judged win-rate in RLHF, avoids overly focusing on the format in rule-based math rewards, and prevents hacking the judge in LLM-as-a-Judge math tasks.
Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao, Danning Ke · May 19, 2025
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026
- In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
- Current CoT evaluation narrowly focuses on target task accuracy.
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao · Feb 20, 2026
- We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
- Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x.
Yuhao Liu, Salim Ullah, Akash Kumar · Feb 26, 2026
Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui · Feb 27, 2026
Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo · Feb 19, 2026
- However, such errors have rarely been captured by existing benchmarks.
- Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored.
Saleha Muzammil, Rahul Reddy, Vishal Kamalakrishnan, Hadi Ahmadi, Wajih Ul Hassan · Dec 7, 2025
Xuying Ning, Dongqi Fu, Tianxin Wei, Mengting Ai, Jiaru Zou, Ting-Wei Li · Mar 1, 2026
Aish Albladi, Md Kaosar Uddin, Minarul Islam, Cheryl Seals · Apr 14, 2025
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026
- Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026
- Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Hao Wu, Xudong Wang, Jialiang Zhang, Junlong Tong, Xinghao Chen, Junyan Lin · Feb 27, 2026
Noah Dasanaike · Jan 29, 2026
Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu · Jul 21, 2025
- Extensive experiments on five benchmark datasets show the superiority of EviOmni, which provides compact and high-quality evidence, enhances the accuracy of downstream tasks, and supports both traditional and agentic RAG systems.
Fan Shu, Yite Wang, Ruofan Wu, Boyi Liu, Zhewei Yao, Yuxiong He · Feb 27, 2026
- The fast-growing demands in using Large Language Models (LLMs) to tackle complex multi-step data science tasks create an emergent need for accurate benchmarking.
- To bridge these gaps, we introduce DARE-bench, a benchmark designed for machine learning modeling and data science instruction following.
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Kun Xu · Feb 24, 2026
- Building on this insight, we propose CAMEL, a confidence-gated reflection framework that performs a lightweight single-token preference decision first and selectively invokes reflection only for low-confidence instances.
- Empirically, CAMEL achieves state-of-the-art performance on three widely used reward-model benchmarks with 82.9% average accuracy, surpassing the best prior model by 3.2% and outperforming 70B-parameter models using only 14B parameters,…
Protocol Hubs
Benchmark Hubs
Metric Hubs
- Accuracy & Pass Rate Metric Papers (88)
- Accuracy Metric Papers (82)
- Accuracy & Pass Rate Metric Papers In CS.CL (63)
- Accuracy & Pass Rate Metric Papers + Automatic Metrics (74)
- Accuracy In CS.CL Papers (58)
- Accuracy & Pass Rate Metric Papers In CS.AI (58)
- Accuracy + Automatic Metrics Metric Papers (70)
- Accuracy + Automatic Metrics Metric Papers (Last 120 Days) (53)
- Accuracy + Automatic Metrics Metric Papers (Last 90 Days) (51)
- Accuracy + Automatic Metrics Metric Papers (Last 30 Days) (47)
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.