Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 22 Search mode: keyword Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

Application-Driven Pedagogical Knowledge Optimization of Open-Source LLMs via Reinforcement Learning and Supervised Fine-Tuning

Navan Preet Singh, Xiaokun Wang, Anurag Garikipati, Madalina Ciobanu, Qingqing Mao, Ritankar Das · Apr 7, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready
Expert Verification Automatic Metrics General
  • These models remarkably achieve high enough accuracy on the Cross-Domain Pedagogical Knowledge (CDPK) Benchmark to establish new state-of-the-art (SOTA) results across the interactive Pedagogy Benchmark Leaderboard and surpass significantly…
Open paper
SODIUM: From Open Web Data to Queryable Databases

Chuxuan Hu, Philip Li, Maxwell Yang, Daniel Kang · Mar 19, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% High protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics Multi Agent General
  • Existing systems struggle with SODIUM tasks: we evaluate 6 advanced AI agents on SODIUM-Bench, with the strongest baseline achieving only 46.5% accuracy.
  • To bridge this gap, we develop SODIUM-Agent, a multi-agent system composed of a web explorer and a cache manager.
Open paper
Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert VerificationRlaif Or Synthetic Feedback Automatic Metrics General
  • Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples).
Open paper
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

Anna Martin-Boyle, William Humphreys, Martha Brown, Cara Leckey, Harmanpreet Kaur · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics General
  • Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in…
  • We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues.
Open paper
LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li, Lingyong Yan · Feb 15, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics General
  • By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art…
Open paper
Aggregate vs. Personalized Judges in Business Idea Evaluation: Evidence from Expert Disagreement

Wataru Hirota, Tomoki Taniguchi, Tomoko Ohkuma, Kosuke Takahashi, Takahiro Omi, Kosuke Arima · Apr 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Rubric RatingExpert Verification General
  • Unlike standard NLP benchmarks, business idea evaluation relies on multi-dimensional criteria such as feasibility, novelty, differentiation, user need, and market size, and expert judgments often disagree.
  • This paper studies a methodological question raised by such disagreement: should an automatic judge approximate an aggregate consensus, or model evaluators individually?
Open paper
Seeing but Not Thinking: Routing Distraction in Multimodal Mixture-of-Experts

Haolei Xu, Haiwen Hong, Hongxing Li, Rui Zhou, Yang Zhang, Longtao Huang · Apr 9, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Expert Verification General
  • Experiments on three multimodal MoE models across six benchmarks demonstrate consistent improvements, with gains of up to 3.17% on complex visual reasoning tasks.
Open paper
Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman, Stephanie Milani, Fei Fang · Apr 6, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Expert Verification General
  • Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions.
  • Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions.
Open paper
FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models

Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang · Apr 2, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 58% Sparse protocol signal Freshness: Hot Status: Fallback
Expert Verification General
  • Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer…
Open paper
Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada · Mar 6, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceExpert Verification Llm As Judge General
  • This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods.
  • We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key…
Open paper
"Are You Sure?": An Empirical Study of Human Perception Vulnerability in LLM-Driven Agentic Systems

Xinfeng Li, Shenyu Dai, Kelong Zheng, Yue Xiao, Gelei Deng, Wei Dong · Feb 24, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics General
  • Large language model (LLM) agents are rapidly becoming trusted copilots in high-stakes domains like software development and healthcare.
  • However, this deepening trust introduces a novel attack surface: Agent-Mediated Deception (AMD), where compromised agents are weaponized against their human users.
Open paper
Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu · Feb 16, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready
Pairwise PreferenceRubric Rating Multi Agent General
  • Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined.
  • Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional…
Open paper
Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation

Denica Kjorvezir, Danilo Najkov, Eva Valencič, Erika Jesenko, Barbara Koroišić Seljak, Tome Eftimov · Mar 10, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification General
  • The evaluation of expert assessments enables the estimation of which similarity aspects--lexical, semantic, or nutritional--are most influential in expert decision-making.
Open paper

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification General
  • A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent.
Open paper
Hyper-KGGen: A Skill-Driven Knowledge Extractor for High-Quality Knowledge Hypergraph Generation

Rizhuo Huang, Yifan Feng, Rundong Xue, Shihui Ying, Jun-Hai Yong, Chuan Shi · Feb 23, 2026

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 52% Sparse protocol signal Freshness: Warm Status: Fallback
Expert Verification General
  • Additionally, we present HyperDocRED, a rigorously annotated benchmark for document-level knowledge hypergraph extraction.
Open paper
DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park, Hosung Song · Dec 19, 2025

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 50% Moderate protocol signal Freshness: Cold Status: Ready
Rubric RatingExpert Verification Long Horizon General
  • However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
  • To address these issues, we propose DEER, a benchmark for evaluating expert-level deep research reports.
Open paper
GraphMERT: Efficient and Scalable Distillation of Reliable Knowledge Graphs from Unstructured Data

Margarita Belova, Jiaxin Xiao, Shikhar Tuli, Niraj K. Jha · Oct 10, 2025

Citations: 0

Match reason: Matches selected tags (General, Expert Verification).

Score: 53% Moderate protocol signal Freshness: Cold Status: Fallback
Expert Verification Automatic Metrics General
  • GraphMERT + KG is the first efficient and scalable neurosymbolic model to achieve state-of-the-art benchmark accuracy along with superior symbolic representations relative to baselines.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.