Researcher Tools

Human Feedback and Eval Paper Explorer

A focused feed for RLHF, preference data, rater protocols, agent evaluation, and LLM-as-judge research. Every paper includes structured metadata for quick triage.

Total papers: 16 Search mode: keyword Shortlist (0) RSS

Filter by tag

All Automatic Metrics (1,735) General (557) Long Horizon (344) Pairwise Preference (298) Coding (234) Simulation Env (201) Multi Agent (199) Medicine (119) Llm As Judge (113) Expert Verification (102) Human Eval (92) Rubric Rating (85) Web Browsing (84) Math (82) Demonstrations (73) Red Team (67)

Featured Papers

Popular high-signal papers with direct links to full protocol pages.

Spend Less, Fit Better: Budget-Efficient Scaling Law Fitting via Active Experiment Selection
Apr 24, 2026 · Citations: 0

Across a diverse benchmark of scaling-law tasks, our method consistently outperforms classical design-based baselines, and often approaches the performance of fitting on the full experimental set while using only about 10% of the total…
How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Apr 24, 2026 · Citations: 0

In this paper, we present the first systematic study of token consumption patterns in agentic coding tasks.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Apr 24, 2026 · Citations: 0

Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities.
Relaxation-Informed Training of Neural Network Surrogate Models
Apr 24, 2026 · Citations: 0

Experiments on non-convex benchmark functions and a two-stage stochastic programming problem with quantile neural network surrogates demonstrate that the proposed regularizers can reduce MILP solve times by up to four orders of magnitude…
An Undecidability Proof for the Plan Existence Problem
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Neural Recovery of Historical Lexical Structure in Bantu Languages from Modern Data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Zero-Shot Morphological Discovery in Low-Resource Bantu Languages via Cross-Lingual Transfer and Unsupervised Clustering
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation
Apr 24, 2026 · Citations: 0

On the QASPER benchmark, UAE improves retrieval Recall@1 by 30.59%, MAP by 30.16% and Token F1 by 17.3% over the strong semantic baseline BGE-Base.
Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
Time-Localized Parametric Decomposition of Respiratory Airflow for Sub-Breath Analysis
Apr 24, 2026 · Citations: 0

Evaluation across 8,276 breaths demonstrates high reconstruction accuracy (mean squared error < 0.001 for four-component models) and robust parameter precision under moderate noise.
CRAFT: Clustered Regression for Adaptive Filtering of Training data
Apr 24, 2026 · Citations: 0

Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.

Browse by Topic

Jump directly into tag and hub pages to crawl deeper content clusters.

Top Protocol Hubs

Start Here By Objective

Pick your immediate research objective and jump directly to high-signal pages, not generic search.

Benchmark Selection

Find papers with explicit benchmark anchors and comparable metric reporting.

Rater Protocol Design

Compare pairwise, rubric, and expert-verification setups before drafting your protocol.

LLM-as-Judge Setup

Start with established judge pipelines and then compare with human-eval references.

Scale Your Evaluation Team

Need human evaluators for your benchmark or preference study? OpenTrain sources pre-vetted domain experts into your annotation pipeline.

See How It Works →

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models

José Pombal, Ricardo Rei, André F. T. Martins · Apr 8, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 65% Moderate protocol signal Freshness: Hot Status: Ready

Pairwise PreferenceRubric Rating Llm As Judge Medicine

We present the first study of SPB in rubric-based evaluation, an increasingly popular benchmarking paradigm where judges issue binary verdicts on individual evaluation criteria, instead of assigning holistic scores or rankings.
Using IFEval, a benchmark with programmatically verifiable rubrics, we show that SPB persists even when evaluation criteria are entirely objective: among rubrics where generators fail, judges can be up to 50\% more likely to incorrectly…

Open paper

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen · Mar 27, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric RatingExpert Verification Automatic Metrics LawMedicine

To bridge this gap, we present XpertBench, a high-fidelity benchmark engineered to assess LLMs across authentic professional domains.
To facilitate scalable yet human-aligned assessment, we introduce ShotJudge, a novel evaluation paradigm that employs LLM judges calibrated with expert few-shot exemplars to mitigate self-rewarding biases.

Open paper

OMIND: Framework for Knowledge Grounded Finetuning and Multi-Turn Dialogue Benchmark for Mental Health LLMs

Suraj Racha, Prashant Harish Joshi, Utkarsh Maurya, Nitin Yadav, Mridul Sharma, Ananya Kunisetty · Mar 26, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Medicine

We highlight three primary challenges for LLMs in mental health - lack of high quality interpretable and knowledge grounded training data; training paradigms restricted to core capabilities, and evaluation of multi turn dialogue settings.
Addressing it, we present oMind framework which includes training and aligning LLM agents for diverse capabilities including conversations; high quality ~164k multi-task SFT dataset, as a result of our generation pipeline based on…

Open paper

QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu, Xuehai Wang · Mar 14, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Ready

Rubric Rating Automatic Metrics Medicine

To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment.
During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading.

Open paper

PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Automatic Metrics Medicine

We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses.
Across 7 benchmarks and 5 task models, PEEM's accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001).

Open paper

ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Monica Munnangi, Saiph Savage · Mar 11, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Llm As Judge Medicine

We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns.
We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician…

Open paper

ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Rubric Rating Llm As Judge Medicine

However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows.
We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts.

Open paper

MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang · Feb 13, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Ready

Pairwise PreferenceRubric Rating Long Horizon Medicine

MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities.
For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision…

Open paper

MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen, Yafeng Deng · Mar 4, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Fallback

Rubric Rating Automatic Metrics Long Horizon Medicine

Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract…

Open paper

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard · Mar 2, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% High protocol signal Freshness: Warm Status: Fallback

Rubric RatingExpert Verification Llm As JudgeAutomatic Metrics Medicine

Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.
We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.

Open paper

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Yucheng Chu, Hang Li, Kaiqi Yang, Yasemin Copur-Gencturk, Joseph Krajcik, Namsoo Shin · Feb 28, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 58% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric Rating Automatic Metrics Medicine

Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods.

Open paper

Decomposing Physician Disagreement in HealthBench

Satya Borgohain, Roy Mariathas · Feb 26, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 55% Moderate protocol signal Freshness: Warm Status: Fallback

Rubric Rating Medicine

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.
The agreement ceiling in medical AI evaluation is thus largely structural, but the reducible/irreducible dissociation suggests that closing information gaps in evaluation scenarios could lower disagreement where inherent clinical ambiguity…

Open paper

From Evidence-Based Medicine to Knowledge Graph: Retrieval-Augmented Generation for Sports Rehabilitation and a Domain Benchmark

Jinning Zhang, Jie Song, Wenhui Tu, Zecheng Li, Jingxuan Li, Jin Li · Jan 1, 2026

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Rubric RatingExpert Verification Automatic Metrics Medicine

Validated in sports rehabilitation, we release a knowledge graph (357,844 nodes, 371,226 edges) and a benchmark of 1,637 QA pairs.
Five expert clinicians rated the system 4.66--4.84 on a 5-point Likert scale, and system rankings are preserved on a human-verified gold subset (n=80).

Open paper

Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework

Joe Watson, Ivan O'Connor, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell · Oct 9, 2025

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 53% Moderate protocol signal Freshness: Cold Status: Ready

Rubric Rating Automatic MetricsSimulation Env Medicine

This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data.

Open paper

A Scalable Framework for Evaluating Health Language Models

Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow, Nova Hammerquist · Mar 30, 2025

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 53% High protocol signal Freshness: Cold Status: Ready

Rubric RatingExpert Verification Automatic Metrics Medicine

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
In this work, we introduce Adaptive Precise Boolean rubrics: an evaluation framework that streamlines human and automated evaluation of open-ended questions by identifying gaps in model responses using a minimal set of targeted rubrics…

Open paper

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince · Sep 29, 2025

Citations: 0

Match reason: Matches selected tags (Medicine, Rubric Rating).

Score: 46% Sparse protocol signal Freshness: Cold Status: Fallback

Rubric Rating Medicine

Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards.
We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness.

Open paper

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now

Human Feedback and Eval Paper Explorer

Filter by tag

Featured Papers

Browse by Topic

Popular Tags

Top Protocol Hubs

Weekly Eval Paper Digest

Start Here By Objective

Benchmark Selection

Rater Protocol Design

LLM-as-Judge Setup

Scale Your Evaluation Team

Protocol Hubs

Benchmark Hubs

Metric Hubs

Daily Archives

Join the #1 Platform for AI Training Talent