Metric Hub

Agreement + Human Eval Metric Papers

Updated from current HFEPX corpus (Feb 27, 2026). 12 papers are grouped in this metric page. Common evaluation modes: Human Eval, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Retrieval. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 12 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 12 papers for Agreement + Human Eval Metric Papers. Dominant protocol signals include human evaluation, automatic metrics, LLM-as-judge, with frequent benchmark focus on Retrieval and metric focus on agreement, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

25% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments , HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , PreScience: A Benchmark for Forecasting Scientific Contributions
human evaluation appears in 100% of papers in this hub.

Evidence: PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments , Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: Validating Political Position Predictions of Arguments , PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.

Evidence: HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue , PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments
Most common quality-control signal is inter-annotator agreement reporting (58.3% of papers).

Evidence: PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification , ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems , PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments

Benchmark Interpretation

Retrieval appears in 8.3% of hub papers (1/12); use this cohort for benchmark-matched comparisons.

Metric Interpretation

agreement is reported in 100% of hub papers (12/12); compare with a secondary metric before ranking methods.
cost is reported in 16.7% of hub papers (2/12); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (25% vs 45% target).
Maintain strength on Papers reporting quality controls. Coverage is strong (66.7% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (8.3% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (8.3% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (33.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (25% vs 45% target).

Papers reporting quality controls

Coverage is strong (66.7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (8.3% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (8.3% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (33.3% vs 35% target).

Known Limitations

Rater population is under-specified (8.3% coverage).
Benchmark coverage is thin (8.3% of papers mention benchmarks/datasets).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: agreement - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=1, left_only=11, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=1, left_only=11, right_only=0

1 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=1

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention Retrieval.

Examples: Validating Political Position Predictions of Arguments

Metric Brief

agreement

Coverage: 12 papers (100%)

12 papers (100%) mention agreement.

Examples: PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments

Metric Brief

cost

Coverage: 2 papers (16.7%)

2 papers (16.7%) mention cost.

Examples: Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification , Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis

Metric Brief

accuracy

Coverage: 1 papers (8.3%)

1 papers (8.3%) mention accuracy.

Examples: Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: PreScience: A Benchmark for Forecasting Scientific Contributions , Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language , Validating Political Position Predictions of Arguments

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

PreScience: A Benchmark for Forecasting Scientific Contributions
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski · Feb 24, 2026

Human EvalSimulation Env General

We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language
Toheeb Aduramomi Jimoh, Tabea De Wille, Nikola S. Nikolov · Feb 21, 2026

Human Eval General

One annotator pair achieved almost perfect agreement ($κ= 0.8743$; $93.8\%$ raw agreement), exceeding a number of reported benchmarks for English sarcasm research works.
Validating Political Position Predictions of Arguments
Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

Human Eval General

Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.
Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems
Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan · Feb 19, 2026

Human Eval LawCoding

Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026

Human Eval CodingMultilingual

Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
Are LLMs Ready to Replace Bangla Annotators?
Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · Feb 18, 2026

Human Eval General

In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences.
BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR
Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026

Human Eval Multilingual

Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.
ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics
Hend Al-Khalifa, Nadia Ghezaiel, Maria Bounnit, Hend Hamed Alhazmi, Noof Abdullah Alfear · Feb 14, 2026

Human Eval General

We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models.
propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale
Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries · Feb 12, 2026

Human Eval Multilingual

We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose,
Large Language Models as Automatic Annotators and Annotation Adjudicators for Fine-Grained Opinion Analysis
Gaurav Negi, MA Waskow, John McCrae, Paul Buitelaar · Jan 23, 2026

Human Eval General

Although this level of detail is sound, it requires considerable human effort and substantial cost to annotate opinions in datasets for training models, especially across diverse domains and real-world applications.
HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue
Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026

Human EvalLlm As Judge General

Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.
Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025

Human EvalAutomatic Metrics Coding

We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n

Agreement + Human Eval Metric Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs