HFEPX Hub

Human Eval Papers (Last 30 Days)

Updated from current HFEPX corpus (Apr 27, 2026). 17 papers are grouped in this hub page.

Read Full Context

Updated from current HFEPX corpus (Apr 27, 2026). 17 papers are grouped in this hub page. Common evaluation modes: Human Eval, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Adjudication. Frequently cited benchmark: Insightbench. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 31, 2026.

Papers: 17 Last published: Mar 31, 2026 Global RSS Tag RSS

Human EvalLast 30d

Researcher Quick Triage

This hub is best used for protocol triage and replication planning from abstract-level evidence. Quality band: Developing .

All Sampled Papers (17) Replication-Ready Only (2)

High-Signal Coverage

100.0%

17 / 17 sampled papers are not low-signal flagged.

Replication-Ready Set

Benchmark + metric + eval mode explicitly present.

Judge/Human Comparability

Papers containing both `human_eval` and `llm_as_judge`.

2 papers are replication-ready (benchmark + metric + explicit evaluation mode).
0 papers support judge-vs-human agreement analysis.
4 papers report explicit quality controls (calibration/adjudication/IAA).

Primary action: Use this page for scouting only; collect additional papers before attempting replication-critical comparisons.

Need evaluators for this research workflow?

Post a Job →

Why This Matters For Eval Research

40% of papers report explicit human-feedback signals, led by rubric ratings.
human evaluation appears in 58.8% of papers in this hub.
Insightbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Protocol Takeaways

Most common quality-control signal is adjudication (11.8% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Benchmark Interpretation

Insightbench appears in 10% of hub papers (1/17); use this cohort for benchmark-matched comparisons.
Rewardbench appears in 10% of hub papers (1/17); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 50% of hub papers (5/17); compare with a secondary metric before ranking methods.
agreement is reported in 20% of hub papers (2/17); compare with a secondary metric before ranking methods.

Researcher Checklist (Expanded)

Researcher Checklist

Moderate: Papers with explicit human feedback

Coverage is usable but incomplete (40% vs 45% target).
Strong: Papers reporting quality controls

Coverage is strong (30% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (80% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (50% vs 35% target).
Moderate: Papers with known annotation unit

Coverage is usable but incomplete (30% vs 35% target).

Strengths

Quality-control evidence appears in 30% of papers.
Agentic evaluation appears in 30% of papers.

Known Gaps

No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.
Stratify by benchmark (Insightbench vs Rewardbench) before comparing methods.
Track metric sensitivity by reporting both accuracy and agreement.

Recommended Queries (Expanded)

Recommended Queries

Human Eval Protocols Benchmark Slice: Insightbench Metric Slice: accuracy IAA-Reported Evaluations Recent High-Signal Papers

Start with These 3

Use these when you need one protocol anchor, one benchmark anchor, and one recent comparison point before reading the wider hub.

Strongest protocol reference

Personalized RewardBench: Evaluating Reward Models with Human Aligned…

Highest protocol score with explicit human/eval signal plus Rewardbench.

Strongest benchmark reference

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects…

Reported benchmark with kappa gives a fast comparison anchor.

Strongest recent paper

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues

Useful for current practice scanning; published Mar 31, 2026.

Start Here (Best First 6)

Ranked for protocol completeness (human signal, benchmark + metric anchors, quality controls, and judge/human overlap).

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Apr 8, 2026 · Citations: 0 · Score: 7.5

HF: Pairwise Preference, Rubric Rating · Eval: Human Eval, Automatic Metrics · Benchmark: Rewardbench · Metric: Accuracy
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 7.0

HF: Rubric Rating · Eval: Human Eval · Benchmark: Not Reported · Metric: Kappa
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Mar 31, 2026 · Citations: 0 · Score: 5.5

HF: Rubric Rating, Expert Verification · Eval: Human Eval · Benchmark: Not Reported · Metric: Not Reported
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Mar 29, 2026 · Citations: 0 · Score: 5.5

HF: Expert Verification · Eval: Human Eval, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Apr 7, 2026 · Citations: 0 · Score: 5.5

HF: Not reported · Eval: Human Eval · Benchmark: Insightbench · Metric: Recall
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Apr 6, 2026 · Citations: 0 · Score: 5.0

HF: Not reported · Eval: Human Eval, Automatic Metrics · Benchmark: Not Reported · Metric: Accuracy

Protocol Matrix (Top 12)

Use this to quickly compare protocol ingredients instead of scanning long prose.

Paper	HF Signal	Eval Modes	Benchmarks	Metrics	QC
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization Apr 8, 2026	Yes Pairwise Preference , Rubric Rating	Human Eval , Automatic Metrics	Rewardbench	Accuracy , Helpfulness	Not Reported
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Yes Rubric Rating	Human Eval	Not Reported	Kappa , Agreement	Inter Annotator Agreement Reported , Adjudication
CounselReflect: A Toolkit for Auditing Mental-Health Dialogues Mar 31, 2026	Yes Rubric Rating , Expert Verification	Human Eval	Not Reported	Not Reported	Adjudication
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning Mar 29, 2026	Yes Expert Verification	Human Eval , Automatic Metrics	Not Reported	Accuracy	Not Reported
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling Apr 7, 2026	No Not Reported	Human Eval	Insightbench	Recall	Not Reported
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework Apr 6, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Accuracy , Agreement	Inter Annotator Agreement Reported
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization Mar 31, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Bleu , Rouge	Not Reported
Learning to Predict Future-Aligned Research Proposals with Language Models Mar 28, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Accuracy	Not Reported
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality Apr 8, 2026	No Not Reported	Human Eval , Automatic Metrics	Not Reported	Accuracy	Not Reported
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill Apr 8, 2026	No Not Reported	Human Eval , Simulation Env	Not Reported	Not Reported	Not Reported

Protocol Diff (Top Papers)

Fast side-by-side comparison for the highest-ranked papers in this hub.

Signal	Personalized RewardBench: Evaluating Reward Models…	LLM Essay Scoring Under Holistic and Analytic Rubri…	CounselReflect: A Toolkit for Auditing Mental-Healt…
Human Feedback	Pairwise Preference, Rubric Rating	Rubric Rating	Rubric Rating, Expert Verification
Evaluation Modes	Human Eval, Automatic Metrics	Human Eval	Human Eval
Benchmarks	Rewardbench	Not reported	Not reported
Metrics	Accuracy, Helpfulness	Kappa, Agreement	Not reported
Quality Controls	Not reported	Inter Annotator Agreement Reported, Adjudication	Adjudication
Rater Population	Unknown	Unknown	Domain Experts
Annotation Unit	Pairwise	Multi Dim Rubric	Multi Dim Rubric

Research Utility Snapshot

Human Feedback Mix

Rubric Rating (3)
Expert Verification (2)
Pairwise Preference (1)

Evaluation Modes

Human Eval (10)
Automatic Metrics (6)
Simulation Env (1)

Top Benchmarks

Insightbench (1)
Rewardbench (1)

Top Metrics

Accuracy (5)
Agreement (2)
Bleu (1)
Helpfulness (1)

Rater Population Mix

Domain Experts (4)
Mixed (1)

Quality Controls

Adjudication (2)
Inter Annotator Agreement Reported (2)

Coverage diagnostics (sample-based): human-feedback 23.5% · benchmarks 11.8% · metrics 52.9% · quality controls 23.5%.

Top Papers

CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0

Rubric RatingExpert Verification Human Eval Web Browsing

The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Rubric Rating Human Eval

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Qiyao Ma, Dechen Gao, Rui Cai, Boqi Zhao, Hanchu Zhou · Apr 8, 2026 · Citations: 0

Pairwise PreferenceRubric Rating Human EvalAutomatic Metrics

Pluralistic alignment has emerged as a critical frontier in the development of Large Language Models (LLMs), with reward models (RMs) serving as a central mechanism for capturing diverse human values.
Improving Clinical Diagnosis with Counterfactual Multi-Agent Reasoning
Zhiwen You, Xi Chen, Aniket Vashishtha, Simo Du, Gabriel Erion-Barner · Mar 29, 2026 · Citations: 0

Expert Verification Human EvalAutomatic Metrics Multi Agent

In this work, we propose a counterfactual multi-agent diagnostic framework inspired by clinician training that makes hypothesis testing explicit and evidence-grounded.
DataSTORM: Deep Research on Large-Scale Databases using Exploratory Data Analysis and Data Storytelling
Shicheng Liu, Yucheng Jiang, Sajid Farook, Camila Nicollier Sanchez, David Fernando Castro Pena · Apr 7, 2026 · Citations: 0

Human Eval Long Horizon

Deep research with Large Language Model (LLM) agents is emerging as a powerful paradigm for multi-step information discovery, synthesis, and analysis.
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
Jiling Zhou, Aisvarya Adeseye, Seppo Virtanen, Antti Hakkala, Jouni Isoaho · Apr 6, 2026 · Citations: 0

Human EvalAutomatic Metrics

However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation.
Less Is More? Selective Visual Attention to High-Importance Regions for Multimodal Radiology Summarization
Mst. Fahmida Sultana Naznin, Adnan Ibney Faruq, Mushfiqur Rahman, Niloy Kumar Mondal, Md. Mehedi Hasan Shawon · Mar 31, 2026 · Citations: 0

Human EvalAutomatic Metrics

Through controlled ablations on MIMIC-CXR benchmark, we show that selectively focusing on pathology-relevant visual patches rather than full images yields substantially better performance.
Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
How Long Reasoning Chains Influence LLMs' Judgment of Answer Factuality
Minzhu Tu, Shiyu Ni, Keping Bi · Apr 8, 2026 · Citations: 0

Human EvalAutomatic Metrics

Large language models (LLMs) has been widely adopted as a scalable surrogate for human evaluation, yet such judges remain imperfect and susceptible to surface-level biases.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill
Xuanbo Su, Wenhao Hu, Haibo Su, Yunzhang Chen, Le Zhan · Apr 8, 2026 · Citations: 0

Human EvalSimulation Env

We introduce SalesLLM benchmark, a bilingual (ZH/EN) benchmark derived from realistic applications covering Financial Services and Consumer Goods, built from 30,074 scripted configurations and 1,805 curated multi-turn scenarios with…
Dharma, Data and Deception: An LLM-Powered Rhetorical Analysis of Cow-Urine Health Claims on YouTube
Sheza Munir, Ratna Kandala, Anamta Khan, Deepti, Joyojeet Pal · Apr 24, 2026 · Citations: 0

Human Eval

Human evaluation of a subset of annotations yielded 90.1\% inter-annotator agreement, confirming the reliability of our taxonomy and validation process.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
Gabriel Stefan, Adrian-Marius Dumitran · Apr 9, 2026 · Citations: 0

Human Eval

We propose an agentic evaluation architecture comprising a multimodal screening agent, a heterogeneous jury of five evaluative agents, and a meta-agent for verdict synthesis and human escalation.
STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems
Hongru Ji, Yuyin Fan, Meng Zhao, Xianghua Li, Lianwei Wu · Apr 8, 2026 · Citations: 0

Human Eval

To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with…
PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han · Apr 2, 2026 · Citations: 0

Human Eval

Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
Logarithmic Scores, Power-Law Discoveries: Disentangling Measurement from Coverage in Agent-Based Evaluation
HyunJoon Jung, William Na · Apr 1, 2026 · Citations: 0

Human Eval

LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed?
ContextClaim: A Context-Driven Paradigm for Verifiable Claim Detection
Yufeng Li, Rrubaa Panchendrarajan, Arkaitz Zubiaga · Mar 31, 2026 · Citations: 0

Human Eval

Through component analysis, human evaluation, and error analysis, we further examine when and why the retrieved context contributes to more reliable verifiability judgments.
Open Machine Translation for Esperanto
Ona de Gibert, Lluís de Gibert · Mar 31, 2026 · Citations: 0

Human Eval

In this work, we present the first comprehensive evaluation of open-source MT systems for Esperanto, comparing rule-based systems, encoder-decoder models, and LLMs across model sizes.

Related Hubs

Get Started

Join the #1 Platform for AI Training Talent

Where top AI builders and expert AI Trainers connect to build the future of AI.

Self-Service

Post a Job

Post your project and get a shortlist of qualified AI Trainers and Data Labelers. Hire and manage your team in the tools you already use.

Create Account & Post a Job

Managed Service

For Large Projects

Done-for-You

We recruit, onboard, and manage a dedicated team inside your tools. End-to-end operations for large or complex projects.

Learn About Managed Service

For Freelancers

Join as an AI Trainer

Find AI training and data labeling projects across platforms, all in one place. One profile, one application process, more opportunities.

Join Now