Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • We present Kwame 2.0, a bilingual (English-French) generative AI teaching assistant built using retrieval-augmented generation and deployed in a human-in-the-loop forum within SuaCode, an introductory mobile-based coding course for learners…
  • Evaluation using community feedback and expert ratings shows that Kwame 2.0 provided high-quality and timely support, achieving high accuracy on curriculum-related questions, while human facilitators and peers effectively mitigated errors,…
Open paper
On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar · Mar 30, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Marco DeepResearch: Unlocking Efficient Deep Research Agents via Verification-Centric Design

Bin Zhu, Qianghuai Jia, Tian Lan, Junyang Ren, Feng Gu, Feihu Jiang · Mar 30, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Long Horizon General
  • Deep research agents autonomously conduct open-ended investigations, integrating complex information retrieval with multi-step reasoning across diverse sources to solve real-world problems.
  • To address this, we present Marco DeepResearch, a deep research agent optimized with a verification-centric framework design at three levels: (1)~QA Data Synthesis: We introduce verification mechanisms to graph-based and agent-based QA…
Open paper
PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding
  • We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
  • Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.
Open paper
Multi-Agent Dialectical Refinement for Enhanced Argument Classification

Jakub Bąba, Jarosław A. Chudziak · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Llm As JudgeAutomatic Metrics Multi Agent General
  • We introduce MAD-ACC (Multi-Agent Debate for Argument Component Classification), a framework that leverages dialectical refinement to resolve classification uncertainty.
  • Evaluation on the UKP Student Essays corpus demonstrates that MAD-ACC achieves a Macro F1 score of 85.7%, significantly outperforming single-agent reasoning baselines, without requiring domain-specific training.
Open paper
Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics Medicine
  • Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients.
  • Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained…
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready
Human Eval General
  • These systems commonly achieve performance levels comparable to or superior than trained human raters, but have frequently been demonstrated to be vulnerable to the influence of construct-irrelevant factors (i.e., features of responses that…
Open paper
Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching

Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready
Long Horizon Math
  • Finally, we empirically validate the advantages of our contributions across multiple continuous control benchmarks.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before.
  • We evaluate on BigCodeBench~zhuo2025bigcodebench, KGQAGen-10k~zhang2025kgqagen, and Humanity's Last Exam~phan2025hle using Claude Sonnet 4.5 and Opus 4.5.
Open paper
Retromorphic Testing with Hierarchical Verification for Hallucination Detection in RAG

Boxi Yu, Yuzhong Zhang, Liting Lin, Lionel Briand, Emir Muñoz · Mar 29, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • We evaluate RT4CHART on RAGTruth++ (408 samples) and RAGTruth-Enhance (2,675 samples), a newly re-annotated benchmark.
  • Finally, our re-annotation reveals 1.68x more hallucination cases than the original labels, suggesting that existing benchmarks substantially underestimate the prevalence of hallucinations.
Open paper
Long-Document QA with Chain-of-Structured-Thought and Fine-Tuned SLMs

Zhuowen Liang, Xiaotian Lin, Zhengxuan Zhang, Yuyu Luo, Haixun Wang, Nan Tang · Mar 31, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Adaptive Block-Scaled Data Types

Jack Cook, Hyemin S. Lee, Kathryn Le, Junxian Guo, Giovanni Traverso, Anantha P. Chandrakasan · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
GraphWalker: Agentic Knowledge Graph Question Answering via Synthetic Trajectory Curriculum

Shuwen Xu, Yao Xu, Jiaxiang Liu, Chenhao Yuan, Wenshuo Peng, Jun Zhao · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Long Horizon Coding
  • Agentic knowledge graph question answering (KGQA) requires an agent to iteratively interact with knowledge graphs (KGs), posing challenges in both training data scarcity and reasoning generalization.
  • Specifically, existing approaches often restrict agent exploration: prompting-based methods lack autonomous navigation training, while current training pipelines usually confine reasoning to predefined trajectories.
Open paper
An Empirical Recipe for Universal Phone Recognition

Shikhar Bharadwaj, Chin-Jou Li, Kwanghee Choi, Eunjung Yeo, William Chen, Shinji Watanabe · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
CodingMultilingual
  • Through controlled ablations with evaluations across 100+ languages under a unified scheme, we empirically establish our training recipe and quantify the impact of SSL representations, data scale, and loss objectives.
Open paper
Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Comparison and evaluation of graph-based representations of sentence meaning is a challenge because competing representations of the same sentence may have different number of nodes, and it is not obvious which nodes should be compared to…
  • We compare it with previously used approaches, in particular smatch (the de-facto standard in AMR evaluation), and argue that sensitivity to word alignment makes the comparison of meaning representations more intuitive and interpretable,…
Open paper
GlowQ: Group-Shared LOw-Rank Approximation for Quantized LLMs

Selim An, Il hong Suh, Yeseong Kim · Mar 26, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.