HFEPX Metric Hub

Agreement + Rubric Rating Metric Papers

Updated from current HFEPX corpus (Apr 5, 2026). 10 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Apr 5, 2026). 10 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Healthbench. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Mar 31, 2026.

Papers: 10 Last published: Mar 31, 2026 Global RSS

Researcher Quick Triage

Use this page to compare metric behavior across protocols and benchmarks before selecting your reporting stack. Quality band: Developing .

Metric Coverage

90.0%

9 sampled papers include metric names.

Benchmark Anchoring

20.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

50.0%

5 papers report calibration/adjudication/IAA controls.

10 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Primary action: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Why This Matters (Expanded)

Why This Matters For Eval Research

100% of papers report explicit human-feedback signals, led by rubric ratings.
automatic metrics appears in 50% of papers in this hub.
Healthbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is inter-annotator agreement reporting (50% of papers).
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Metric Interpretation

agreement is reported in 100% of hub papers (10/10); compare with a secondary metric before ranking methods.
accuracy is reported in 20% of hub papers (2/10); compare with a secondary metric before ranking methods.

Benchmark Context

Healthbench appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.
Interaction2eval appears in 10% of hub papers (1/10); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 9.0

Metrics: Kappa, Agreement · Eval: Human Eval
More Human, More Efficient: Aligning Annotations with Quantized SLMs
Apr 1, 2026 · Citations: 0 · Score: 9.0

Metrics: Agreement · Eval: Automatic Metrics
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Mar 25, 2026 · Citations: 0 · Score: 9.0

Metrics: Agreement, Cost · Eval: Automatic Metrics
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Jan 6, 2026 · Citations: 0 · Score: 8.5

Metrics: Accuracy, Agreement · Eval: Automatic Metrics
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Mar 11, 2026 · Citations: 0 · Score: 7.5

Metrics: Spearman · Eval: Llm As Judge
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Sep 30, 2025 · Citations: 0 · Score: 7.5

Metrics: Agreement · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Kappa, Agreement	Not reported	Human Eval	Inter Annotator Agreement Reported, Adjudication
More Human, More Efficient: Aligning Annotations with Quantized SLMs Apr 1, 2026	Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported, Adjudication
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools Mar 25, 2026	Agreement, Cost	Interaction2eval	Automatic Metrics	Not reported
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text Jan 6, 2026	Accuracy, Agreement	Not reported	Automatic Metrics	Calibration, Inter Annotator Agreement Reported
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge Mar 11, 2026	Spearman	Not reported	Llm As Judge	Not reported
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages Sep 30, 2025	Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring Mar 6, 2026	Agreement	Not reported	Human Eval	Not reported
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue Jan 9, 2026	Agreement	Not reported	Human Eval, Llm As Judge	Not reported
A Scalable Framework for Evaluating Health Language Models Mar 30, 2025	Accuracy, Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Decomposing Physician Disagreement in HealthBench Feb 26, 2026	Not reported	Healthbench	Not reported	Not reported

Researcher Workflow (Detailed)

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (100% vs 45% target).
Strong: Papers reporting quality controls

Coverage is strong (50% vs 30% target).
Gap: Papers naming benchmarks/datasets

Coverage is a replication risk (20% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Strong: Papers with known rater population

Coverage is strong (70% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (100% vs 35% target).

Strengths

Strong human-feedback signal (100% of papers).
Quality-control evidence appears in 50% of papers.
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

No dominant metadata gap detected in current extraction coverage.

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Healthbench vs Interaction2eval) before comparing methods.
Track metric sensitivity by reporting both agreement and accuracy.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Healthbench Metric Slice: agreement IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

No dominant metadata gap detected in current extraction coverage.
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Snapshot (Detailed)

Top Metrics

Agreement (10)
Accuracy (2)
Cost (2)
Kappa (1)

Evaluation Modes

Automatic Metrics (5)
Human Eval (3)
Llm As Judge (2)

Top Benchmarks

Healthbench (1)
Interaction2eval (1)

Agentic Mix

Top Papers Reporting This Metric

LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Human Eval General

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
More Human, More Efficient: Aligning Annotations with Quantized SLMs
Jiayu Wang, Junyoung Lee · Apr 1, 2026 · Citations: 0

Automatic Metrics General

As Large Language Model (LLM) capabilities advance, the demand for high-quality annotation of exponentially increasing text corpora has outpaced human capacity, leading to the widespread adoption of LLMs in automatic evaluation and…
From Intuition to Calibrated Judgment: A Rubric-Based Expert-Panel Study of Human Detection of LLM-Generated Korean Text
Shinwoo Park, Yo-Sub Han · Jan 6, 2026 · Citations: 0

Automatic Metrics General

Distinguishing human-written Korean text from fluent LLM outputs remains difficult even for trained readers, who can over-trust surface well-formedness.
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025 · Citations: 0

Automatic Metrics Medicine

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.
Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge
Mingyang Song, Mao Zheng, Chenning Xu · Mar 11, 2026 · Citations: 0

Llm As Judge General

Through a large-scale study of 105,600 evaluation instances (32 LLMs \times 3 frontier judges \times 100 tasks \times 11 temperatures), we show that model-level agreement (Spearman ρ= 0.99) masks fragile sample-level agreement (Pearson r =…
Evaluating Austrian A-Level German Essays with Large Language Models for Automated Essay Scoring
Jonas Kubesch, Lena Huber, Clemens Havas · Mar 6, 2026 · Citations: 0

Human Eval General

This paper investigates the application of state-of-the-art open-weight LLMs for the grading of Austrian A-level German texts, with a particular focus on rubric-based evaluation.
When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools
Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao · Mar 25, 2026 · Citations: 0

Automatic Metrics General

In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments.
MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0

Automatic Metrics Multilingual

To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026 · Citations: 0

Human EvalLlm As Judge General

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
Decomposing Physician Disagreement in HealthBench
Satya Borgohain, Roy Mariathas · Feb 26, 2026 · Citations: 0

Medicine

We decompose physician disagreement in the HealthBench medical AI evaluation dataset to understand where variance resides and what observable features can explain it.

Related Metric Hubs

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote