Metric Hub

Agreement + Automatic Metrics Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Multi Dim Rubric. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Contentbench. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 11 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for Agreement + Automatic Metrics Metric Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Contentbench, GSM8K and metric focus on agreement, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

27.3% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Multi-Objective Alignment of Language Models for Personalized Psychotherapy , A Scalable Framework for Evaluating Health Language Models , GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
automatic metrics appears in 100% of papers in this hub.

Evidence: GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Can Large Language Models Replace Human Coders? Introducing ContentBench , Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Contentbench is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Can Large Language Models Replace Human Coders? Introducing ContentBench , GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

Protocol Takeaways

Most common quality-control signal is inter-annotator agreement reporting (36.4% of papers).

Evidence: GATES: Self-Distillation under Privileged Context with Consensus Gating , Revisiting Northrop Frye's Four Myths Theory with Large Language Models , Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language , Incentive-Aligned Multi-Source LLM Summaries
Rater context is mostly domain experts, and annotation is commonly multi-dimensional rubrics; use this to scope replication staffing.

Evidence: Multi-Objective Alignment of Language Models for Personalized Psychotherapy , A Scalable Framework for Evaluating Health Language Models , GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language , GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Can Large Language Models Replace Human Coders? Introducing ContentBench

Benchmark Interpretation

Contentbench appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.
GSM8K appears in 9.1% of hub papers (1/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

agreement is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.
accuracy is reported in 63.6% of hub papers (7/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (27.3% vs 45% target).
Maintain strength on Papers reporting quality controls. Coverage is strong (45.5% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (18.2% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (18.2% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (27.3% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (27.3% vs 45% target).

Papers reporting quality controls

Coverage is strong (45.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (18.2% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (18.2% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Known Limitations

Rater population is under-specified (18.2% coverage).
Benchmark coverage is thin (18.2% of papers mention benchmarks/datasets).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Contentbench - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: agreement - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=10

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=1, left_only=10, right_only=0

1 papers use both Automatic Metrics and Simulation Env.

human_eval vs simulation_env

both=0, left_only=1, right_only=1

0 papers use both Human Eval and Simulation Env.

Benchmark Brief

Contentbench

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention Contentbench.

Examples: Can Large Language Models Replace Human Coders? Introducing ContentBench

Benchmark Brief

GSM8K

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

agreement

Coverage: 11 papers (100%)

11 papers (100%) mention agreement.

Examples: GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Can Large Language Models Replace Human Coders? Introducing ContentBench

Metric Brief

accuracy

Coverage: 7 papers (63.6%)

7 papers (63.6%) mention accuracy.

Examples: GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

Metric Brief

cost

Coverage: 4 papers (36.4%)

4 papers (36.4%) mention cost.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Can Large Language Models Replace Human Coders? Introducing ContentBench , Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: GATES: Self-Distillation under Privileged Context with Consensus Gating , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Can Large Language Models Replace Human Coders? Introducing ContentBench

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026

Automatic Metrics Math

Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026

Automatic Metrics Coding

This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Multi-Objective Alignment of Language Models for Personalized Psychotherapy
Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli · Feb 17, 2026

Automatic Metrics Medicine

While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety.
Revisiting Northrop Frye's Four Myths Theory with Large Language Models
Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado · Feb 17, 2026

Automatic Metrics General

Northrop Frye's theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather t
Mechanistic Indicators of Steering Effectiveness in Large Language Models
Mehdi Jafari, Hao Xue, Flora Salim · Feb 2, 2026

Automatic Metrics Coding

Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges.
Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025

Automatic Metrics General

Large language models (LLMs) are increasingly used as raters for evaluation tasks.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025

Human EvalAutomatic Metrics Coding

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025

Automatic Metrics General

Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and a
A Scalable Framework for Evaluating Health Language Models
Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025

Automatic Metrics Medicine

As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.

Agreement + Automatic Metrics Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs