Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 129 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

Yuchi Wang, Haiyang Yu, Weikang Bian, Jiefeng Long, Xiao Liang, Chao Feng · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 95% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Experiments on the MMEB-V2 benchmark demonstrate that our model achieves a score of 71.2 with only 4B parameters, establishing a new state-of-the-art while significantly reducing reasoning overhead and inference latency.
Open paper
Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li, Yuchen Wu · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 100% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Multi Agent MathLaw
  • In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
  • Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure.
Open paper
Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Across two challenging benchmarks, E-VQA and InfoSeek, Region-R1 delivers consistent gains, achieving state-of-the-art performances by increasing conditional Recall@1 by up to 20%.
Open paper
Cost-Efficient Estimation of General Abilities Across Benchmarks

Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner · Apr 1, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 90% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error…
  • We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
Open paper
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents

Heng Zhou, Zelin Tan, Zhemeng Zhang, Yutao Fan, Yibing Lin, Li Kang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • When an LLM-based agent improves on a task, is the gain from the model itself or from the reasoning paradigm wrapped around it?
  • We study this question by comparing six inference-time paradigms, namely Direct, CoT, ReAct, Plan-Execute, Reflection, and ReCode, across four frontier LLMs and ten benchmarks, yielding roughly 18,000 runs.
Open paper
Reliable Control-Point Selection for Steering Reasoning in Large Language Models

Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics MathCoding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Aligning Multimodal Sequential Recommendations via Robust Direct Preference Optimization with Sparse MoE

Hejin Huang, Jusheng Zhang, Kaitong Cai, Jian Wang, Rong Pan · Mar 31, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 84% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics General
  • Preference-based alignment objectives have been widely adopted, from RLHF-style pairwise learning in large language models to emerging applications in recommender systems.
  • With an optional sparse Mixture-of-Experts encoder for efficient capacity scaling, RoDPO achieves up to 5.25% NDCG@5 on three Amazon benchmarks, with nearly unchanged inference cost.
Open paper

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 89% High protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood.
  • We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark.
Open paper
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Jiayuan Ye, Vitaly Feldman, Kunal Talwar · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Law
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • This suggests that iTAG-generated data can serve as a practical surrogate for scalable benchmarking of text-based causal discovery algorithms.
Open paper
Adaptive Prompt Structure Factorization: A Framework for Self-Discovering and Optimizing Compositional Prompt Programs

Haoyue Liu, Zhichao Wang, Yongxin Guo, Haoran Shou, Xiaoying Tang · Apr 8, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Across multiple advanced reasoning benchmarks, aPSF outperforms strong baselines including principle-aware optimizers, improving accuracy by up to +2.16 percentage points on average, and reduces optimization cost by 45--87% tokens on…
Open paper
Context-Aware Dialectal Arabic Machine Translation with Interactive Region and Register Selection

Afroza Nowshin, Prithweeraj Acharjee Porag, Haziq Jeelani, Fayeq Jeelani Syed · Apr 7, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Multilingual
  • Through a combination of automatic evaluation and qualitative analysis, we observe an apparent accuracy-fidelity trade-off: high-resource baselines such as NLLB (No Language Left Behind) achieve higher aggregate BLEU scores (13.75) by…
  • Supporting qualitative evaluation, including an LLM-assisted cultural authenticity analysis, suggests improved dialectal alignment compared to baseline systems (4.80/5 vs.
Open paper
Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics MathMultilingual
  • Across two mathematical reasoning benchmarks, four LRMs, and 10 languages, we find that most features are positively associated with accuracy, but the strength of association varies considerably across languages and can even reverse in…
  • Our findings challenge English-centric reward designs and point toward adaptive objectives that accommodate language-specific reasoning patterns, with concrete implications for multilingual benchmark and reward design.
Open paper
Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models

Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 79% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics MathMedicine
  • Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear.
  • Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and…
Open paper
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Weian Mao, Xi Lin, Wei Huang, Yuxin Xie, Tianfu Fu, Bohan Zhuang · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Ready
Pairwise Preference Automatic Metrics Law
  • Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation.
Open paper
Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions

Atilla Kaan Alkan, Felix Grezes, Jennifer Lynn Bartlett, Anna Kelbert, Kelly Lockhart, Alberto Accomazzi · Apr 2, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics Coding
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 68% Moderate protocol signal Freshness: Hot Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Verify Before You Commit: Towards Faithful Reasoning in LLM Agents via Self-Auditing

Wenhao Yuan, Chenchen Lin, Jian Chen, Jinfeng Xu, Xuehe Wang, Edith Cheuk Han Ngai · Apr 9, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Long Horizon General
  • In large language model (LLM) agents, reasoning trajectories are treated as reliable internal beliefs for guiding actions and updating memory.
  • In this paper, inspired by the vulnerability of unfaithful intermediate reasoning trajectories, we propose Self-Audited Verified Reasoning (SAVeR), a novel framework that enforces verification over internal belief states within the agent…
Open paper

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 73% Moderate protocol signal Freshness: Hot Status: Fallback
Automatic Metrics Tool Use General
  • Autonomous tool-using agents in networked environments must decide which information source to query and when to stop querying and act.
  • Without principled bounds on information-acquisition costs, unconstrained agents exhibit systematic failure modes: excessive tool use under congestion, prolonged deliberation under time decay, and brittle behavior under ambiguous evidence.
Open paper
Selecting Decision-Relevant Concepts in Reinforcement Learning

Naveen Raman, Stephanie Milani, Fei Fang · Apr 6, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 63% Sparse protocol signal Freshness: Hot Status: Fallback
Expert Verification General
  • Training interpretable concept-based policies requires practitioners to manually select which human-understandable concepts an agent should reason with when making sequential decisions.
  • Our key insight is that concept selection can be viewed through the lens of state abstraction: intuitively, a concept is decision-relevant if removing it would cause the agent to confuse states that require different actions.
Open paper

Protocol Hubs

Benchmark Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.