Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 268 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

SPQ: An Ensemble Technique for Large Language Model Compression

Jiamin Yao, Eren Gultepe · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.

Open paper

Multi-Token Prediction via Self-Distillation

John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein · Feb 5, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning

Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su, Xiaoqing Wang · Feb 22, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training.

Open paper

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory

Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning

Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin, Bang Yang · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.

Open paper

Improving Sampling for Masked Diffusion Models via Information Gain

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Extensive evaluations across diverse architectures and tasks (reasoning, coding, creative writing, and image generation) demonstrate that Info-Gain Sampler consistently outperforms existing samplers for MDMs.

Open paper

Analyzing LLM Instruction Optimization for Tabular Fact Verification

Xiaotang Du, Giwon Hong, Wai-Chung Kwan, Rohit Saxena, Ivan Titov, Pasquale Minervini · Feb 20, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

We study three optimizers from the DSPy framework -- COPRO, MiPROv2, and SIMBA -- across four benchmarks and three model families.
We find that instruction optimization consistently improves verification accuracy, with MiPROv2 yielding the most stable gains for CoT, and SIMBA providing the largest benefits for ReAct agents, particularly at larger model scales.

Open paper

Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Medicine

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.

Open paper

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu, Jianfeng Gao · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Math

Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over…

Open paper

Quecto-V1: Empirical Analysis of 8-bit Quantized Small Language Models for On-Device Legal Retrieval

Subrit Dikshit · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics LawCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement

Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin, Dhruv Grewal · Feb 17, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
This study introduces the Linguistic eXtractor (LX), a fine-tuned, large language model trained on consumer-authored text that also has been labeled with consumers' self-reported ratings of 16 consumption-related emotions and four…

Open paper

Weight space Detection of Backdoors in LoRA Adapters

David Puertolas Merenciano, Ekaterina Vasyagina, Kevin Zhu, Javier Ferrando, Maheep Chaudhary · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech

Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu, Xiaobao Wang · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MedicineCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning

Qianyue Wang, Jinwu Hu, Huanxiang Lin, Bolin Chen, Zhiquan Wen, Yaofo Chen · Feb 16, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR) transforming LRMs'reasoning paradigm…

Open paper

Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Samir Abdaljalil, Erchin Serpedin, Hasan Kurban · Feb 15, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings.
Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error.

Open paper

AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum · Feb 12, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics Coding

Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference.
Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model…

Open paper

LLMs Know More About Numbers than They Can Say

Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MathCoding

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Open paper

From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Chunlin Chen · Feb 6, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 80% Moderate protocol signal Freshness: Warm Status: Ready

Automatic Metrics MedicineCoding

In the paper, we propose MA-RAG (Multi-Round Agentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic…
Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering substantial +6.8 points on average accuracy over the backbone model.

Open paper

Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Citations: 0

Match reason: Keyword overlap 2/2 across title and protocol fields.

Score: 83% High protocol signal Freshness: Warm Status: Fallback

Automatic Metrics Multi Agent MathCoding

Existing Multi-Agent Systems (MAS) typically rely on homogeneous model configurations, failing to exploit the diverse expertise inherent in different post-trained architectures.
Team-of-Thoughts introduces two novel components: (1) Orchestrator Calibration, which identifies models with superior coordination and synthesis capabilities, and (2) Agent Self-Assessment, a protocol where tool agents profile their own…

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent