Skip to content
← Back to explorer

Metric Hub

Agreement In CS.CL Papers

Updated from current HFEPX corpus (Feb 27, 2026). 23 papers are grouped in this metric page. Common evaluation modes: Human Eval, Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Contentbench. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 23 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 23 papers for Agreement In CS.CL Papers. Dominant protocol signals include human evaluation, automatic metrics, LLM-as-judge, with frequent benchmark focus on Contentbench, GSM8K and metric focus on agreement, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Contentbench appears in 4.3% of hub papers (1/23); use this cohort for benchmark-matched comparisons.
  • GSM8K appears in 4.3% of hub papers (1/23); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • agreement is reported in 100% of hub papers (23/23); compare with a secondary metric before ranking methods.
  • accuracy is reported in 26.1% of hub papers (6/23); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Close gap on Papers with explicit human feedback. Coverage is a replication risk (26.1% vs 45% target).
  • Maintain strength on Papers reporting quality controls. Coverage is strong (56.5% vs 30% target).
  • Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (13% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (13% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (30.4% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (26.1% vs 45% target).

Papers reporting quality controls

Coverage is strong (56.5% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (13% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (13% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (30.4% vs 35% target).

Suggested Reading Order

  1. 1. GATES: Self-Distillation under Privileged Context with Consensus Gating

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. MERRY: Semantically Decoupled Evaluation of Multimodal Emotional and Role Consistencies of Role-Playing Agents

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. PreScience: A Benchmark for Forecasting Scientific Contributions

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Yor-Sarc: A gold-standard dataset for sarcasm detection in a low-resource African language

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

  6. 6. Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Can Large Language Models Replace Human Coders? Introducing ContentBench

    Adds automatic metrics with critique/edit feedback for broader coverage within this hub.

  8. 8. Validating Political Position Predictions of Arguments

    Adds human evaluation with pairwise preferences for broader coverage within this hub.

Known Limitations

  • Rater population is under-specified (13% coverage).
  • Benchmark coverage is thin (13% of papers mention benchmarks/datasets).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs llm_as_judge

both=1, left_only=11, right_only=2

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=1, left_only=11, right_only=9

1 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=3, right_only=10

0 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Contentbench

Coverage: 1 papers (4.3%)

1 papers (4.3%) mention Contentbench.

Examples: Can Large Language Models Replace Human Coders? Introducing ContentBench

Benchmark Brief

GSM8K

Coverage: 1 papers (4.3%)

1 papers (4.3%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Benchmark Brief

Retrieval

Coverage: 1 papers (4.3%)

1 papers (4.3%) mention Retrieval.

Examples: Validating Political Position Predictions of Arguments

Top Papers Reporting This Metric

Other Metric Hubs