Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Evaluation of Automatic Speech Recognition Using Generative Large Language Models

Thibault Bañeras-Roux, Shashi Kumar, Driss Khalil, Sergio Burdisso, Petr Motlicek, Shiran Liu · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
  • On the HATS dataset, the best LLMs achieve 92--94\% agreement with human annotators for hypothesis selection, compared to 63\% for WER, also outperforming semantic metrics.
Open paper
TEC: A Collection of Human Trial-and-error Trajectories for Problem Solving

Xinkai Zhang, Jingtao Zhan, Yiqun Liu, Qingyao Ai · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 87% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Trial-and-error is a fundamental strategy for humans to solve complex problems and a necessary capability for Artificial Intelligence (AI) systems operating in real-world environments.
  • With this dataset, we observe that humans achieve substantially higher accuracy compared to LLMs, which demonstrates that humans are more effective in trial-and-error than LLMs.
Open paper
Closing the Speech-Text Gap with Limited Audio for Effective Domain Adaptation in LLM-Based ASR

Thibault Bañeras-Roux, Sergio Burdisso, Esaú Villatoro-Tello, Dairazalia Sánchez-Cortés, Shiran Liu, Severin Baroudi · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • The quality of these edits is typically evaluated against human annotations.
  • However, a sentence may admit multiple valid corrections, and existing evaluation settings do not fully accommodate diverse application scenarios.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 64% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Fine-tuned checkpoints and evaluation scripts are released on HuggingFace.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 68% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon Math
  • Inspired by human cognitive processes, we introduce a backward verification mechanism at each hierarchical layer.
  • Experiments on four mathematical benchmarks demonstrate the effectiveness of our method.
Open paper
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Nicolae Filat, Ahmed Hussain, Konstantinos Kalogiannis, Elena Burceanu · Apr 23, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
General
  • We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark…
  • Across 9-, 30-, and 44-day splits, we observe substantial changes in forecasting error, forgetting, and backward transfer, showing that taskification alone can materially affect CL evaluation.
Open paper
Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 58% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Sparse protocol signal Freshness: Hot Status: Fallback
Critique Edit Coding
  • While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
  • This observation helps explain why the agent achieves near-perfect superficial syntactic alignment yet fails to detect or resolve deeper semantic errors.
Open paper
A Multi-Stage Validation Framework for Trustworthy Large-scale Clinical Information Extraction using Large Language Models

Maria Mahbub, Gregory M. Dams, Josh Arnold, Caitlin Rizy, Sudarshan Srinivasan, Elliot M. Fielstein · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% High protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics MedicineMultilingual
  • Conventional evaluation methods rely heavily on annotation-intensive reference standards or incomplete structured data, limiting feasibility at population scale.
  • Using judge-evaluated outputs as references, the primary LLM achieved an F1 score of 0.80 under relaxed matching criteria.
Open paper
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models

Pengfeng Li, Chen Huang, Chaoqun Hao, Hongyao Chen, Xiao-Yong Wei, Wenqiang Lei · Apr 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
  • To address this, we pioneer METER to systematically benchmark LLMs across all three levels of the causal ladder under a unified context setting.
Open paper
Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 42% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on…
Open paper
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

Xiaomeng Hu, Yinger Zhang, Fei Huang, Jianhong Tu, Yang Su, Lianghao Deng · Apr 13, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 45% Moderate protocol signal Freshness: Hot Status: Fallback
Simulation Env Multi Agent General
  • We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
  • We evaluate 15 frontier models across 8 model families and find that: (1) no single model dominates all industries, as each has a distinct occupational capability profile; (2) implicit faults (truncated data, missing fields) are harder than…
Open paper
Latent-Condensed Transformer for Efficient Long Context Modeling

Zeng You, Yaofo Chen, Qiuwu Chen, Ying Sun, Shuhai Zhang, Yingjian Li · Apr 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
On the Price of Privacy for Language Identification and Generation

Xiaoyu Li, Andi Han, Jiaojiao Jiang, Junbin Gao · Apr 8, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Sparse protocol signal Freshness: Hot Status: Ready
General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Severity-Aware Weighted Loss for Arabic Medical Text Generation

Ahmed Alansary, Molham Mohamed, Ali Hamdi · Apr 7, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Sparse protocol signal Freshness: Hot Status: Ready
Medicine
  • Experiments are conducted using the MAQA dataset, which provides Arabic medical complaints and trusted human responses.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.