Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 397 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • On ELT-Bench, the first benchmark for end-to-end ELT pipeline construction, AI agents initially showed low success rates, suggesting they lacked practical utility.
  • Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss' kappa = 0.85) to audit benchmark quality.
Open paper
PRBench: End-to-end Paper Reproduction in Physics Research

Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu, Mao Li · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic MetricsSimulation Env Coding
  • We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
  • Using an agentified assessment pipeline, we evaluate a set of coding agents on PRBench and analyze their capabilities across key dimensions of scientific reasoning and execution.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow.
  • The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into…
Open paper
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 90% High protocol signal Freshness: Hot Status: Ready
Rubric RatingExpert Verification Automatic Metrics LawMedicine
  • To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
  • To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments.
  • Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior…
Open paper
Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

Paulo Roberto de Moura Júnior, Jean Lelong, Annabelle Blangero · Mar 26, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics LawCoding
  • Despite its central role, chunking lacks a dedicated evaluation framework, making it difficult to assess and compare strategies independently of downstream performance.
Open paper
On the Role of Encoder Depth: Pruning Whisper and LoRA Fine-Tuning in SLAM-ASR

Ganesh Pavan Kartikeya Bharadwaj Kolluri, Michael Kampouridis, Ravi Shekhar · Mar 30, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • No constrained agent outperforms the control individually, yet a 3-agent ensemble achieves 100% ground-truth coverage versus 88.2% for the control.
  • A permutation test confirms only 8% of random 3-agent subsets achieve full coverage, and every successful subset contains the counterfactual agent.
Open paper
Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana · Mar 27, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
EffiSkill: Agent Skill Based Automated Code Efficiency Optimization

Zimu Wang, Yuling Shi, Mengfan Li, Zijun Liu, Jie M. Zhang, Chengcheng Wan · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Ready
MedicineCoding
  • In this paper, we present EffiSkill, a framework for code-efficiency optimization that builds a portable optimization toolbox for LLM-based agents.
  • The key idea is to model recurring slow-to-fast transformations as reusable agent skills that capture both concrete transformation mechanisms and higher-level optimization strategies.
Open paper
AgentSwing: Adaptive Parallel Context Management Routing for Long-Horizon Web Agents

Zhaopeng Feng, Liangcai Su, Zhen Zhang, Xinyu Wang, Xiaotian Zhang, Xiaobin Wang · Mar 29, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • As large language models (LLMs) evolve into autonomous agents for long-horizon information-seeking, managing finite context capacity has become a critical bottleneck.
  • Building on this perspective, we propose AgentSwing, a state-aware adaptive parallel context management routing framework.
Open paper

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • LLM-based autonomous agents lack persistent procedural memory: they re-derive solutions from scratch even when structurally identical tasks have been solved before.
  • We evaluate on BigCodeBench~zhuo2025bigcodebench, KGQAGen-10k~zhang2025kgqagen, and Humanity's Last Exam~phan2025hle using Claude Sonnet 4.5 and Opus 4.5.
Open paper
LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Meituan LongCat Team, Bin Xiao, Chao Wang, Chengjiang Li, Chi Zhang, Chong Peng · Mar 29, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
General
  • As an industrial-strength foundation model, it excels at seeing, painting, and talking within a single framework, achieving strong performance across a wide range of multimodal benchmarks.
Open paper
Training data generation for context-dependent rubric-based short answer grading

Pavel Šindelář, Dávid Slivka, Christopher Bouma, Filip Prášil, Ondřej Bojar · Mar 30, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Fallback
Rubric Rating General
  • However, having to avoid language differences and annotator bias makes the grading of student answers challenging.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.