Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

From Variance to Invariance: Qualitative Content Analysis for Narrative Graph Annotation

Junbo Huang, Max Weinig, Ulrich Fritsche, Ricardo Usbeck · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • To evaluate annotation quality, we employed a 6\times3 factorial experimental design to examine the effects of narrative representation (six levels) and distance metric type (three levels) on inter-annotator agreement (Krippendorrf's α),…
  • The annotation framework and evaluation results provide practical guidance for NLP research on graph-based narrative annotation under HLV.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Evaluating across the GLUE benchmark, we demonstrate that LoRA-based adaptation consistently achieves calibration parity with (and in specific tasks exceeds) full fine-tuning, while maintaining significantly higher parameter efficiency.
Open paper
GLoRIA: Gated Low-Rank Interpretable Adaptation for Dialectal ASR

Pouya Mehralian, Melissa Farasyn, Anne Breitbarth, Anne-Sophie Ghyselen, Hugo Van hamme · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

Jiyuan Shen, Peiyue Yuan, Atin Ghosh, Yifan Mai, Daniel Dahlmeier · Mar 3, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • In this paper, we conduct a large-scale benchmarking study that evaluates various out-of-the-box MLLMs on business-document information extraction.
Open paper
DARS: Dysarthria-Aware Rhythm-Style Synthesis for ASR Enhancement

Minghui Wu, Xueling Liu, Jiahuan Fan, Haitao Tang, Yanyong Zhang, Yue Zhang · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Pairwise Preference Simulation Env General
  • DARS incorporates a multi-stage rhythm predictor optimized by contrastive preferences between normal and dysarthric speech, along with a dysarthric-style conditional flow matching mechanism, jointly enhancing temporal rhythm reconstruction…
Open paper
Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Fallback
Rubric Rating Automatic Metrics Medicine
  • Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.
Open paper
CoME: Empowering Channel-of-Mobile-Experts with Informative Hybrid-Capabilities Reasoning

Yuxuan Liu, Weikai Xu, Kun Huang, Changyu Chen, Jiankun Zhao, Pengzhi Gao · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Mobile Agents can autonomously execute user instructions, which requires hybrid-capabilities reasoning, including screen summary, subtask planning, action decision and action function.
  • To address these challenges, we propose Channel-of-Mobile-Experts (CoME), a novel agent architecture consisting of four distinct experts, each aligned with a specific reasoning stage, CoME activates the corresponding expert to generate…
Open paper
From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim, Junhyun Lee · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 77% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference General
  • To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems.
  • Adopting text anomaly detection as our primary evaluation format, which demands cross-sentence logical inference and resists pattern-matching shortcuts, we demonstrate that this protocol systematically exposes corner-case reasoning errors…
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics MathCoding
  • While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct…
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready
General
  • Every model shows the same pattern: proactive interference (PI) dominates retroactive interference (RI) universally (Cohen's d = 1.73, p < 0.0001), meaning early encodings are protected at the cost of recent information -- the opposite of…
Open paper
CiteAudit: You Cited It, But Did You Read It? A Benchmark for Verifying Scientific References in the LLM Era

Zhengqing Yuan, Kaiwen Shi, Zheyuan Zhang, Lichao Sun, Nitesh V. Chawla, Yanfang Ye · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent General
  • Meanwhile, rapidly growing reference lists make manual verification impractical, and existing automated tools remain fragile to noisy and heterogeneous citation formats and lack standardized evaluation.
  • We present the first comprehensive benchmark and detection framework for hallucinated citations in scientific writing.
Open paper
Recursive Think-Answer Process for LLMs and VLMs

Byung-Kwan Lee, Youngchae Chee, Yong Man Ro · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training

Jinluan Yang, Yuxin Liu, Zhengyu Chen, Chengcheng Han, Yueqing Sun, Qi Gu · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready
Coding
  • Training tool-use agents typically relies on outcome-based filtering: Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.
  • Evaluations on BFCLv3 and Tau2 Bench show that TopoCurate achieves consistent gains of 4.2\% (SFT) and 6.9\% (RL) over state-of-the-art baselines.
Open paper
Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering

Xufei Lv, Jiahui Yang, Haoyuan Sun, Xialin Su, Zhiliang Tian, Yifu Gao · Mar 2, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Fallback
Demonstrations Coding
  • Based on this insight, we propose AT2QA, an Autonomous and Training-free Agent for TKG Question Answering.
  • Experiments on three challenging benchmarks -- MultiTQ, Timeline-CronQuestion, and Timeline-ICEWS-Actor -- show that AT2QA achieves new state-of-the-art performance, surpassing the strongest baselines by 10.7, 4.9, and 11.2 absolute points,…
Open paper
How Large Language Models Get Stuck: Early structure with persistent errors

Alokesh Manna, William Snyder, Whitney Tabor · Feb 27, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 54% Sparse protocol signal Freshness: Warm Status: Fallback
Pairwise Preference General
  • We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation.
  • We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types.
Open paper
Legal RAG Bench: an end-to-end benchmark for legal RAG

Abdur-Rahman Butler, Umar Butler · Mar 2, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics LawCoding
  • We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems.
  • As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure.
Open paper
PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% High protocol signal Freshness: Warm Status: Fallback
Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine
  • Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
  • We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
Open paper
From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi, Houxing Ren, Junting Pan, Aojun Zhou, Ke Wang, Zimu Lu · Mar 3, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready
Math
  • Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching…
  • In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives.
Open paper
Model Agreement via Anchoring

Eric Eaton, Surbhi Goel, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta · Feb 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.