HFEPX Metric Hub

Kappa In CS.CL Papers

Updated from current HFEPX corpus (Jun 30, 2026). 23 papers are grouped in this metric page.

Read Full Context

Updated from current HFEPX corpus (Jun 30, 2026). 23 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Human Eval. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Inter Annotator Agreement Reported. Frequently cited benchmark: Adversabench. Common metric signal: kappa. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Jun 17, 2026.

Papers: 23 Last published: Jun 17, 2026 Global RSS

When This Metric Page Is Useful

Useful for background comparison, but still validate benchmark and protocol details in the linked papers. Quality band: Medium .

Metric Coverage

43.5%

10 sampled papers include metric names.

Benchmark Anchoring

13.0%

Papers with explicit dataset/benchmark anchors for fair comparison.

Quality Controls

43.5%

10 papers report calibration/adjudication/IAA controls.

23 papers are not low-signal flagged in this sample.
Use the protocol matrix below to avoid comparing metrics across incompatible eval setups.

Recommended next step: Treat this as directional signal only; metric reporting is present but benchmark anchoring is still thin.

Main limitation: Benchmark coverage is still thin, so avoid treating this page as a definitive guide to the metric.

What This Metric Page Tells You

90% of papers report explicit human-feedback signals, led by pairwise preferences.
automatic metrics appears in 39.1% of papers in this hub.
Adversabench is a recurring benchmark anchor for cross-paper comparisons in this page.

Metric Notes (Expanded)

Metric-Driven Protocol Takeaways

1 sampled papers report both human evaluation and LLM-as-judge, supporting direct agreement checks.
Most common quality-control signal is inter-annotator agreement reporting (39.1% of papers).
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Metric Interpretation

kappa is reported in 100% of hub papers (10/23); compare with a secondary metric before ranking methods.
agreement is reported in 60% of hub papers (6/23); compare with a secondary metric before ranking methods.

Benchmark Context

Adversabench appears in 10% of hub papers (1/23); use this cohort for benchmark-matched comparisons.
FEVER appears in 10% of hub papers (1/23); use this cohort for benchmark-matched comparisons.

Start Here (Metric-Reliable First 6)

Ranked for metric reporting completeness and comparability.

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
Jun 17, 2026 · Citations: 0 · Score: 10.5

Metrics: Exact match, Kappa · Eval: Llm As Judge, Automatic Metrics
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Jun 23, 2026 · Citations: 0 · Score: 10.5

Metrics: Kappa, Agreement · Eval: Automatic Metrics
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
May 3, 2026 · Citations: 0 · Score: 10.0

Metrics: Accuracy, Kappa · Eval: Automatic Metrics
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Mar 31, 2026 · Citations: 0 · Score: 8.0

Metrics: Kappa, Agreement · Eval: Human Eval
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
Mar 10, 2026 · Citations: 0 · Score: 8.0

Metrics: Accuracy, Kappa · Eval: Automatic Metrics
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Mar 30, 2026 · Citations: 0 · Score: 8.0

Metrics: Kappa, Agreement · Eval: Automatic Metrics

Metric Protocol Matrix (Top 10)

Compare metric, benchmark, and evaluation context side by side.

Paper	Metrics	Benchmarks	Eval Modes	Quality Controls
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias Jun 17, 2026	Exact match, Kappa	MT Bench, Judgebench	Llm As Judge, Automatic Metrics	Inter Annotator Agreement Reported
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability Jun 23, 2026	Kappa, Agreement	Adversabench	Automatic Metrics	Inter Annotator Agreement Reported
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning May 3, 2026	Accuracy, Kappa	FEVER	Automatic Metrics	Inter Annotator Agreement Reported
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias Mar 31, 2026	Kappa, Agreement	Not reported	Human Eval	Inter Annotator Agreement Reported, Adjudication
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring Mar 10, 2026	Accuracy, Kappa	Not reported	Automatic Metrics	Adjudication
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories Mar 30, 2026	Kappa, Agreement	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model Mar 22, 2026	Accuracy, Kappa	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation Mar 20, 2026	Kappa, Faithfulness	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins Feb 23, 2026	Accuracy, F1	Not reported	Automatic Metrics	Inter Annotator Agreement Reported
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith Mar 25, 2026	Accuracy, Kappa	Not reported	Human Eval, Llm As Judge	Inter Annotator Agreement Reported

How To Use This Page

Checklist

Strong: Papers with explicit human feedback

Coverage is strong (90% vs 45% target).
Strong: Papers reporting quality controls

Coverage is strong (100% vs 30% target).
Moderate: Papers naming benchmarks/datasets

Coverage is usable but incomplete (30% vs 35% target).
Strong: Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).
Gap: Papers with known rater population

Coverage is a replication risk (20% vs 35% target).
Strong: Papers with known annotation unit

Coverage is strong (70% vs 35% target).

Strengths

Strong human-feedback signal (90% of papers).
Quality-control evidence appears in 100% of papers.
Contains both human-eval and LLM-as-judge protocols for head-to-head methodology comparison.

Known Gaps

Rater population is under-specified (20% coverage).

Suggested Next Analyses

Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.
Stratify by benchmark (Adversabench vs FEVER) before comparing methods.
Track metric sensitivity by reporting both kappa and agreement.

Recommended Queries

Judge vs Human Agreement Benchmark Slice: Adversabench Metric Slice: kappa IAA-Reported Evaluations Recent High-Signal Papers

Known Limitations

Rater population is under-specified (20% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Coverage Snapshot

Top Metrics

Kappa (10)
Agreement (6)
Accuracy (5)
Faithfulness (2)

Evaluation Modes

Automatic Metrics (9)
Human Eval (2)
Llm As Judge (2)

Top Benchmarks

Adversabench (1)
FEVER (1)
Judgebench (1)
MT Bench (1)

Agentic Mix

Long Horizon (2)
Multi Agent (1)
Tool Use (1)

Top Papers Reporting This Metric

Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias
Justin D. Norman, Michael U. Rivera, D. Alex Hughes · Jun 17, 2026 · Citations: 0

Llm As JudgeAutomatic Metrics General

We present the largest systematic evaluation of LLM-as-a-Judge to date: 21 judges from nine providers across MT-Bench, JudgeBench, and RewardBench, evaluated under three protocols (agreement, consistency, bias audit) over 118 runs and…
AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability
Khanak Khandelwal · Jun 23, 2026 · Citations: 0

Automatic Metrics Coding

We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker.
The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning
Kwan Soo Shin · May 3, 2026 · Citations: 0

Automatic Metrics General

Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers.
LLM Essay Scoring Under Holistic and Analytic Rubrics: Prompt Effects and Bias
Filip J. Kucia, Anirban Chakraborty, Anna Wróblewska · Mar 31, 2026 · Citations: 0

Human Eval General

We present a systematic evaluation of instruction-tuned LLMs across three open essay-scoring datasets (ASAP 2.0, ELLIPSE, and DREsS) that cover both holistic and analytic scoring.
From Days to Minutes: An Autonomous AI Agent Achieves Reliable Clinical Triage in Remote Patient Monitoring
Seunghwan Kim, Tiffany H. Kung, Heena Verma, Dilan Edirisinghe, Kaveh Sedehi · Mar 10, 2026 · Citations: 0

Automatic Metrics Medicine

Results: Against a human majority-vote standard (N=467), the agent achieved 95.8% emergency sensitivity and 88.5% sensitivity for all actionable alerts (85.7% specificity).
From Consensus to Split Decisions: ABC-Stratified Sentiment in Holocaust Oral Histories
Daban Q. Jaff · Mar 30, 2026 · Citations: 0

Automatic Metrics General

After assembling model outputs, we introduce an agreement-based stability taxonomy (ABC) to stratify inter-model output stability.
SleepVLM: Explainable and Rule-Grounded Sleep Staging via a Vision-Language Model
Guifeng Deng, Pan Wang, Jiquan Wang, Shuying Rao, Junyi Xie · Mar 22, 2026 · Citations: 0

Automatic Metrics Medicine

Expert evaluations further validated the quality of the model's reasoning, with mean scores exceeding 4.0/5.0 for factual accuracy, evidence comprehensiveness, and logical coherence.
Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation
Richard J. Young · Mar 20, 2026 · Citations: 0

Automatic Metrics General

Three classifiers (a regex-only detector, a regex-plus-LLM pipeline, and a Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
Personalized Prediction of Perceived Message Effectiveness Using Large Language Model Based Digital Twins
Jasmin Han, Janardan Devkota, Joseph Waring, Amanda Luken, Felix Naughton · Feb 23, 2026 · Citations: 0

Automatic Metrics General

Model performance was assessed on three held-out messages per participant using accuracy, Cohen's kappa, and F1.
Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith
Somaya Eltanbouly, Samer Rashwani · Mar 25, 2026 · Citations: 0

Human EvalLlm As Judge Coding

Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments.
GradeLegal: Automated Grading for German Legal Cases
Abdullah Al Zubaer, Lorenz Wendlinger, Simon Alexander Nonn, Michael Granitzer, Jelena Mitrovic · May 20, 2026 · Citations: 0
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour · May 18, 2026 · Citations: 0
Given, When, Then, Again: Mining Subscenario Refactoring Candidates in Behaviour-Driven Test Suites with ML Classifiers and LLM-Judge Baselines
Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal · May 14, 2026 · Citations: 0
Exploring Applications of Transfer-State Large Language Models: Cognitive Profiling and Socratic AI Tutoring
Minori Noguchi · Apr 30, 2026 · Citations: 0
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
Dongxin Guo, Jikun Wu, Siu Ming Yiu · Apr 26, 2026 · Citations: 0
Reducing Maintenance Burden in Behaviour-Driven Development: A Paraphrase-Robust Duplicate-Step Detector with a 1.1M-Step Open Benchmark
Ali Hassaan Mughal, Noor Fatima, Muhammad Bilal · Apr 22, 2026 · Citations: 0
IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
Rajveer Singh Pall · Apr 21, 2026 · Citations: 0
IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
David Gringras · Apr 9, 2026 · Citations: 0
Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks
Fatih Uenal · Mar 24, 2026 · Citations: 0
Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study
Weihang Huang, Mengna Liu · Mar 11, 2026 · Citations: 0
LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs
Serene Wang, Lavanya Pobbathi, Haihua Chen · Mar 9, 2026 · Citations: 0
TSVer: A Benchmark for Fact Verification Against Time-Series Evidence
Marek Strong, Andreas Vlachos · Nov 2, 2025 · Citations: 0
CAPC-CG: A Large-Scale, Expert-Directed LLM-Annotated Corpus of Adaptive Policy Communication in China
Bolun Sun, Charles Chang, Yuen Yuen Ang, Ruotong Mu, Yuchen Xu · Oct 10, 2025 · Citations: 0

Related Metrics And Hubs

Kappa Metric Papers Agreement & Reliability Metric Papers In CS.CL Agreement & Reliability Metric Papers CS.CL Human Feedback And Eval Papers Kappa + Automatic Metrics Metric Papers Kappa In CS.AI Papers Coherence + General Metric Papers (16) Helpfulness In CS.CL Papers (17) Helpfulness Metric Papers (20) Relevance + Automatic Metrics Metric Papers (22) Relevance + General Metric Papers (12) Relevance In CS.CL Papers (68) Relevance In CS.LG Papers (21) Relevance Metric Papers (96) Throughput In CS.LG Papers (26) Inference Cost In CS.AI Papers (18)