Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 22 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Expert Verification Automatic Metrics General

These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…

Open paper

SODIUM: From Open Web Data to Queryable Databases

Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics Multi Agent General

Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager.

Open paper

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang · Mar 15, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert VerificationRlaif Or Synthetic Feedback Automatic Metrics General

Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples).

Open paper

Evaluation of LLMs in retrieving food and nutritional context for RAG systems

Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušić Seljak · Mar 10, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics General

Open paper

An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics General

Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues.

Open paper

LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li, Lingyong Yan · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics General

By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…

Open paper

Measuring Complexity at the Requirements Stage: Spectral Metrics as Development Effort Predictors

Maximilian Vierlboeck, Antonio Pugliese, Roshanak Rose Nilchian, Paul T. Grogan, Rashika Sugganahalli Natesh Babu · Feb 6, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics General

Open paper

Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Rubric RatingExpert Verification General

Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree.
This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually?

Open paper

Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang, Longtao Huang · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Expert Verification General

Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks.

Open paper

Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman, Stephanie Milani, Fei Fang · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Expert Verification General

Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions.
Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions.

Open paper

FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback

Expert Verification General

Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer…

Open paper

Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceExpert Verification Llm As Judge General

This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key…

Open paper

"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Expert Verification Automatic Metrics General

Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users.

Open paper

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Multi Agent General

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional…

Open paper

SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs

Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang, Yi Wu · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Expert Verification General

Open paper

Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation

Denica Kjorvezir, Danilo Najkov, Eva Valencič, Erika Jesenko, Barbara Koroišić Seljak, Tome Eftimov · Mar 10, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Expert Verification General

The evaluation of expert assessments enables the estimation of which similarity aspects--lexical, semantic, or nutritional--are most influential in expert decision-making.

Open paper

Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues

Bradley P. Allen · Mar 7, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Expert Verification General

A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent.

Open paper

Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong, Chuan Shi · Feb 23, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback

Expert Verification General

Additionally, we present HyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.

Open paper

DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song · Dec 19, 2025

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready

Rubric RatingExpert Verification Long Horizon General

However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports.

Open paper

GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha · Oct 10, 2025

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback

Expert Verification Automatic Metrics General

GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent