Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 411 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,651) General (540) Long Horizon (322) Pairwise Preference (294) Coding (222) Simulation Env (191) Multi Agent (185) Medicine (117) Llm As Judge (110) Expert Verification (98) Human Eval (90) Rubric Rating (83) Math (79) Web Browsing (79) Demonstrations (67) Critique Edit (65)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Seeing Fast and Slow: Learning the Flow of Time in Videos
Apr 23, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability
Apr 23, 2026 · Citations: 0

We argue that this temporal taskification step is not a neutral preprocessing choice, but a structural component of evaluation: different valid splits of the same stream can induce different CL regimes and therefore different benchmark…
Evaluation of Automatic Speech Recognition Using Generative Large Language Models
Apr 23, 2026 · Citations: 0

Embedding-based semantic metrics are better correlated with human perception, but decoder-based Large Language Models (LLMs) remain underexplored for this task.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
Apr 16, 2026 · Citations: 0

We propose MM-WebAgent, a hierarchical agentic framework for multimodal webpage generation that coordinates AIGC-based element generation through hierarchical planning and iterative self-reflection.
Latent-Condensed Transformer for Efficient Long Context Modeling
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Beyond Prompt: Fine-grained Simulation of Cognitively Impaired Standardized Patients via Stochastic Steering
Apr 14, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
METER: Evaluating Multi-Level Contextual Causal Reasoning in Large Language Models
Apr 13, 2026 · Citations: 0

Existing benchmarks, however, often evaluate this skill in fragmented settings, failing to ensure context consistency or cover the full causal hierarchy.
Revisiting Compositionality in Dual-Encoder Vision-Language Models: The Role of Inference
Apr 13, 2026 · Citations: 0

Dual-encoder Vision-Language Models (VLMs) such as CLIP are often characterized as bag-of-words systems due to their poor performance on compositional benchmarks.
METRO: Towards Strategy Induction from Expert Dialogue Transcripts for Non-collaborative Dialogues
Apr 13, 2026 · Citations: 0

Developing non-collaborative dialogue agents traditionally requires the manual, unscalable codification of expert strategies.
Towards Proactive Information Probing: Customer Service Chatbots Harvesting Value from Conversation
Apr 13, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
Apr 13, 2026 · Citations: 0

We introduce OccuBench, a benchmark covering 100 real-world professional task scenarios across 10 industry categories and 65 specialized domains, enabled by Language Environment Simulators (LESs) that simulate domain-specific environments…
Who Wrote This Line? Evaluating the Detection of LLM-Generated Classical Chinese Poetry
Apr 11, 2026 · Citations: 0

To address these issues, we introduce ChangAn, a benchmark for detecting LLM-generated classical Chinese poetry that containing total 30,664 poems, 10,276 are human-written poems and 20,388 poems are generated by four popular LLMs.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Anterior's Approach to Fairness Evaluation of Automated Prior Authorization System

Sai P. Selvaraj, Khadija Mahmoud, Anuj Iravane · Mar 15, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

We propose a fairness evaluation framework for prior authorization models based on model error rates rather than approval outcomes.
Using 7,166 human-reviewed cases spanning 27 medical necessity guidelines, we assessed consistency in sex, age, race/ethnicity, and socioeconomic status.

Open paper

Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction

Shidong He, Haoyu Wang, Wenjie Luo · Mar 14, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 73% Sparse protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Jon-Paul Cacioli · Mar 16, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics.
All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80).

Open paper

Truth as a Compression Artifact in Language Model Training

Konstantin Krestnikov · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise Preference Automatic Metrics MathCoding

In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus.
Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy).

Open paper

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Teng Pan, Yuchen Yan, Zixuan Wang, Ruiqing Zhang, Guiyang Hou, Wenqi Zhang · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Math

Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9% on mathematical reasoning benchmarks.

Open paper

Attention-guided Evidence Grounding for Spoken Question Answering

Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He · Mar 17, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

General

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Tae-Eun Song · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware…

Open paper

Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi · Mar 12, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 61% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Web Browsing General

To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off.
Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning.

Open paper

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen · Mar 16, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 51% Sparse protocol signal Freshness: Warm Status: Ready

Coding

Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable.
To mitigate this threat, we propose the first guardrail that operates outside the agent's perceptual loop.

Open paper

LED: A Benchmark for Evaluating Layout Error Detection in Document Analysis

Inbum Heo, Taewook Hwang, Jeesu Jung, Sangkeun Jung · Mar 18, 2026

Citations: 0

Match reason: Keyword overlap 1/2 across title and protocol fields.

Score: 57% Moderate protocol signal Freshness: Warm Status: Fallback

Automatic MetricsSimulation Env General

To overcome this limitation, we propose Layout Error Detection (LED), a benchmark that evaluates structural reasoning in DLA predictions beyond surface-level accuracy.
Using these definitions, we construct LED-Dataset and design three evaluation tasks: document-level error detection, document-level error-type classification, and element-level error-type classification.

Open paper

The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QA

Yasaman Zarrinkia, Venkatesh Srinivasan, Alex Thomo · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and…
Surprisingly, we show that, with question-type routing, a fully augmented budget open-weight Llama-8B model matches or exceeds the unaugmented Llama-70B baseline on all three benchmarks at ~12x lower cost.

Open paper

PCodeTrans: Translate Decompiled Pseudocode to Compilable and Executable Equivalent

Yuxin Cui, Zeyu Gao, Shuxian He, Siliang Qin, Chao Zhang · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Decompilation is foundational to binary analysis, yet conventional tools prioritize human readability over strict recompilability and verifiable runtime correctness.
While recent LLM-based approaches attempt to refine decompiled pseudocode, they typically either optimize solely for readability or rely on static analysis for evaluation.

Open paper

Universe Routing: Why Self-Evolving Agents Need Epistemic Control

Zhaohui Geoffrey Wang · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics General

A critical failure mode of current lifelong agents is not lack of knowledge, but the inability to decide how to reason.
When an agent encounters "Is this coin fair?" it must recognize whether to invoke frequentist hypothesis testing or Bayesian posterior inference - frameworks that are epistemologically incompatible.

Open paper

Gauge-Equivariant Intrinsic Neural Operators for Geometry-Consistent Learning of Elliptic PDE Maps

Pengcheng Cheng · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 35% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Math

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Criterion-referenceability determines LLM-as-a-judge validity across physics assessment formats

Will Yeadon, Tom Hardy, Paul Mackay, Elise Agra · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 38% Moderate protocol signal Freshness: Warm Status: Fallback

Llm As JudgeAutomatic Metrics Coding

We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations…
Across n=55 scripts (n=275 essays), blind AI marking is harsher and more variable than human marking, with discriminative validity already poor (ρ\approx 0.1).

Open paper

From Words to Worlds: Benchmarking Cross-Cultural Cultural Understanding in Machine Translation

Bangju Han, Yingqi Wang, Huang Qing, Tiyuan Li, Fengyi Yang, Ahtamjan Ahmat · Mar 18, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready

CodingMultilingual

To address this gap, we introduce CulT-Eval, a benchmark designed to evaluate how models handle different types of culturally grounded expressions.
Accordingly, we propose a complementary evaluation metric that targets culturally induced meaning deviations overlooked by standard MT metrics.

Open paper

GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev · Mar 14, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready

General

We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.

Open paper

TopoBench: Benchmarking LLMs on Hard Topological Reasoning

Mayug Maniparambil, Nils Hoehing, Janak Kapuriya, Arjun Karuvally, Ellen Rushe, Anthony Ventresque · Mar 12, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 32% Sparse protocol signal Freshness: Warm Status: Ready

Coding

To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels.
Code and data are available at github.com/mayug/topobench-benchmark.

Open paper

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang · Mar 17, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain's multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text…
Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains.

Open paper

Towards Next-Generation LLM Training: From the Data-Centric Perspective

Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng · Mar 16, 2026

Citations: 0

Match reason: Matched by broad semantic/index fallback.

Score: 28% Sparse protocol signal Freshness: Warm Status: Ready

General

In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing…
First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent