Skip to content

Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 129 Search mode: keyword Ranking: eval-signal prioritized Shortlist (0) RSS

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

The top RLHF, evaluation, and human feedback papers — curated and summarized every Friday.

No spam. Unsubscribe anytime.

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models

Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng, Kyle Lam · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready
Expert Verification Automatic Metrics MedicineCoding
  • Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
  • We introduce MEDSYN, a multilingual, multimodal benchmark of highly complex clinical cases with up to 7 distinct visual clinical evidence (CE) types per case.
Open paper
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning

Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li, Zhen Qin · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 88% Moderate protocol signal Freshness: Warm Status: Ready
Rubric Rating Automatic Metrics General
  • Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
Open paper
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference

Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 4/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 93% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Tool Use General
  • Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Open paper
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models

Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
  • Evaluation reveals substantial performance variation, with accuracy ranging from 14.29\% to 99.05\% across models and strategies.
Open paper
Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories.
  • We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
Open paper
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 71% High protocol signal Freshness: Warm Status: Ready
Pairwise Preference Automatic Metrics Math
  • Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
  • Results show that pairwise self-preferences provide strong optimization signal for test-time improvement over large, discrete output spaces.
Open paper
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: high protocol signal.

Score: 71% High protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Multi Agent MathCoding
  • Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
  • Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own…
Open paper
DySCO: Dynamic Attention-Scaling Decoding for Long-Context Language Models

Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Across multiple instruction-tuned and reasoning models, DYSCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with…
Open paper

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
Open paper
GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning

Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks

Bruno De Filippo, Alessandro Guidotti, Alessandro Vanelli-Coralli · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.
Open paper
CHESS: Context-aware Hierarchical Efficient Semantic Selection for Long-Context LLM Inference

Chao Fei, Guozhong Li, Chenxi Liu, Panos Kalnis · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Coding
  • Extensive evaluations demonstrate that CHESS surpasses Full-KV quality using only 1\% of the KV cache, delivers low-latency stable inference with up to 4.56\times higher throughput, and consistently outperforms other strong baselines.
Open paper
ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning

Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 77% Moderate protocol signal Freshness: Warm Status: Fallback
Automatic Metrics Long Horizon Math
  • We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
  • It turns math problems into a controlled, correctness-checkable benchmark with tool sets, enabling systematic evaluation of model reliability under (1) large, overlapping tool catalogs and (2) the absence of the intended capability.
Open paper
Luna-2: Scalable Single-Token Evaluation with Small Language Models

Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 3/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 72% Moderate protocol signal Freshness: Warm Status: Fallback
Llm As JudgeAutomatic Metrics General
  • We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g.
  • Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x.
Open paper
Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 66% Moderate protocol signal Freshness: Warm Status: Fallback
Rubric Rating Automatic Metrics Medicine
  • Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.
Open paper
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA

Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun, Linfang Shang · Feb 26, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics Law
  • We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for…
  • Experiments on an internal advertising QA dataset show consistent gains across expert-judged dimensions including accuracy, completeness, and safety, while reducing the hallucination rate by 72\%.
Open paper
Revisiting Text Ranking in Deep Research

Chuan Meng, Litu Ou, Sean MacAvaney, Jeff Dalton · Feb 25, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • To tackle it, most prior work equips large language model (LLM)-based agents with opaque web search APIs, enabling agents to iteratively issue search queries, retrieve external evidence, and reason over it.
  • passages), (ii) pipeline configurations (different retrievers, re-rankers, and re-ranking depths), and (iii) query characteristics (the mismatch between agent-issued queries and the training queries of text rankers).
Open paper
Generative Pseudo-Labeling for Pre-Ranking with LLMs

Junyu Bi, Xinting Niu, Daixuan Cheng, Kun Yuan, Tao Wang, Binbin Cao · Feb 24, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
No One Size Fits All: QueryBandits for Hallucination Mitigation

Nicole Cho, William Watson, Alec Koppel, Sumitra Ganesh, Manuela Veloso · Feb 23, 2026

Citations: 0

Match reason: Keyword overlap 2/4 across title and protocol fields. Eval-signal density: moderate protocol signal.

Score: 61% Moderate protocol signal Freshness: Warm Status: Ready
Automatic Metrics General
  • Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Open paper
Eigenmood Space: Uncertainty-Aware Spectral Graph Analysis of Psychological Patterns in Classical Persian Poetry

Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 1/4 across title and protocol fields. Eval-signal density: sparse protocol signal.

Score: 40% Sparse protocol signal Freshness: Warm Status: Ready
General
  • The resulting framework supports scalable, auditable digital-humanities analysis while preserving interpretive caution by propagating uncertainty from verse-level evidence to poet-level inference.
Open paper

Protocol Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.
Self-Service
Post a Job
Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.
Managed Service
For Large Projects
Done-for-You
We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.
For Freelancers
Join as an AI Trainer
Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.