Skip to content
← Back to explorer

Metric Hub

Agreement In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 13 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Mixed. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Contentbench. Common metric signal: agreement. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 24, 2026.

Papers: 13 Last published: Feb 24, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 13 papers for Agreement In CS.AI Papers. Dominant protocol signals include automatic metrics, human evaluation, simulation environments, with frequent benchmark focus on Contentbench, GSM8K and metric focus on agreement, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

  • Contentbench appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.
  • GSM8K appears in 7.7% of hub papers (1/13); use this cohort for benchmark-matched comparisons.

Metric Interpretation

  • agreement is reported in 100% of hub papers (13/13); compare with a secondary metric before ranking methods.
  • accuracy is reported in 38.5% of hub papers (5/13); compare with a secondary metric before ranking methods.

Researcher Checklist

  • Tighten coverage on Papers with explicit human feedback. Coverage is usable but incomplete (30.8% vs 45% target).
  • Maintain strength on Papers reporting quality controls. Coverage is strong (46.2% vs 30% target).
  • Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (23.1% vs 35% target).
  • Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
  • Close gap on Papers with known rater population. Coverage is a replication risk (7.7% vs 35% target).
  • Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (23.1% vs 35% target).

Papers with explicit human feedback

Coverage is usable but incomplete (30.8% vs 45% target).

Papers reporting quality controls

Coverage is strong (46.2% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (23.1% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (7.7% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (23.1% vs 35% target).

Suggested Reading Order

  1. 1. PreScience: A Benchmark for Forecasting Scientific Contributions

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  2. 2. Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  3. 3. Can Large Language Models Replace Human Coders? Introducing ContentBench

    Start here for detailed protocol reporting, including rater and quality-control evidence.

  4. 4. Validating Political Position Predictions of Arguments

    Include a human-eval paper to anchor calibration against automated judge settings.

  5. 5. HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Include an LLM-as-judge paper to assess judge design and agreement assumptions.

  6. 6. Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

    Adds automatic metrics for broader coverage within this hub.

  7. 7. Are LLMs Ready to Replace Bangla Annotators?

    Adds human evaluation for broader coverage within this hub.

  8. 8. Revisiting Northrop Frye's Four Myths Theory with Large Language Models

    Adds automatic metrics for broader coverage within this hub.

Known Limitations

  • Rater population is under-specified (7.7% coverage).
  • Annotation unit is under-specified (23.1% coverage).
  • Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

human_eval vs llm_as_judge

both=1, left_only=4, right_only=0

1 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=0, left_only=5, right_only=7

0 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=0, left_only=1, right_only=7

0 papers use both Llm As Judge and Automatic Metrics.

Top Papers Reporting This Metric

  • PreScience: A Benchmark for Forecasting Scientific Contributions

    Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski · Feb 24, 2026

    Human EvalSimulation Env General

    We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.

  • Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

    Arindam Khaled · Feb 23, 2026

    Automatic Metrics Math

    In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.

  • Can Large Language Models Replace Human Coders? Introducing ContentBench

    Michael Haman · Feb 23, 2026

    Automatic Metrics Coding

    This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.

  • Validating Political Position Predictions of Arguments

    Jordan Robinson, Angus R. Williams, Katie Atkinson, Anthony G. Cohn · Feb 20, 2026

    Human Eval General

    Real-world knowledge representation often requires capturing subjective, continuous attributes -- such as political positions -- that conflict with pairwise validation, the widely accepted gold standard for human evaluation.

  • Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation

    Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

    Automatic MetricsSimulation Env General

    When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.

  • Are LLMs Ready to Replace Bangla Annotators?

    Md. Najib Hasan, Touseef Hasan, Souvika Sarkar · Feb 18, 2026

    Human Eval General

    In this work, we study the behavior of LLMs as zero-shot annotators for Bangla hate speech, a task where even human agreement is challenging, and annotator bias can have serious downstream consequences.

  • Revisiting Northrop Frye's Four Myths Theory with Large Language Models

    Edirlei Soares de Lima, Marco A. Casanova, Antonio L. Furtado · Feb 17, 2026

    Automatic Metrics General

    Northrop Frye's theory of four fundamental narrative genres (comedy, romance, tragedy, satire) has profoundly influenced literary criticism, yet computational approaches to his framework have focused primarily on narrative patterns rather t

  • BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

    Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique · Feb 16, 2026

    Human Eval Multilingual

    Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity.

  • DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations

    Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen · Feb 2, 2026

    Simulation Env General

    However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC.

  • HEART: A Unified Benchmark for Assessing Humans and LLMs in Emotional Support Dialogue

    Laya Iyer, Kriti Aggarwal, Sanmi Koyejo, Gail Heyman, Desmond C. Ong · Jan 9, 2026

    Human EvalLlm As Judge General

    Despite rapid progress in language models, we still lack a clear way to understand how their abilities in these interpersonal domains compare to those of humans.

  • Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

    Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025

    Automatic Metrics General

    Large language models (LLMs) are increasingly used as raters for evaluation tasks.

  • Incentive-Aligned Multi-Source LLM Summaries

    Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025

    Automatic Metrics General

    Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and a

  • A Scalable Framework for Evaluating Health Language Models

    Neil Mallinar, A. Ali Heydari, Xin Liu, Anthony Z. Faranesh, Brent Winslow · Mar 30, 2025

    Automatic Metrics Medicine

    As LLM-driven health applications are increasingly adopted, rigorous and efficient one-sided evaluation methodologies are crucial to ensure response quality across multiple dimensions, including accuracy, personalization and safety.

Other Metric Hubs