Metric Hub

Calibration In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 11 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics. Most common rater population: Domain Experts. Common annotation unit: Pairwise. Frequent quality control: Calibration. Frequently cited benchmark: MMLU. Common metric signal: calibration. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 23, 2026.

Papers: 11 Last published: Feb 23, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 11 papers for Calibration In CS.AI Papers. Dominant protocol signals include automatic metrics, with frequent benchmark focus on MMLU, Retrieval and metric focus on calibration, accuracy. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

18.2% of papers report explicit human-feedback signals, led by expert verification.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment
automatic metrics appears in 100% of papers in this hub.

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
MMLU is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Humanity's Last Exam , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment

Protocol Takeaways

Most common quality-control signal is rater calibration (100% of papers).

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Rater context is mostly domain experts, and annotation is commonly pairwise annotation; use this to scope replication staffing.

Evidence: Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling , Humanity's Last Exam , KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Stratify by benchmark (MMLU vs Retrieval) before comparing methods.

Evidence: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment , Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling

Benchmark Interpretation

MMLU appears in 18.2% of hub papers (2/11); use this cohort for benchmark-matched comparisons.
Retrieval appears in 18.2% of hub papers (2/11); use this cohort for benchmark-matched comparisons.

Metric Interpretation

calibration is reported in 100% of hub papers (11/11); compare with a secondary metric before ranking methods.
accuracy is reported in 36.4% of hub papers (4/11); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (18.2% vs 45% target).
Maintain strength on Papers reporting quality controls. Coverage is strong (100% vs 30% target).
Maintain strength on Papers naming benchmarks/datasets. Coverage is strong (36.4% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (18.2% vs 35% target).
Tighten coverage on Papers with known annotation unit. Coverage is usable but incomplete (27.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (18.2% vs 45% target).

Papers reporting quality controls

Coverage is strong (100% vs 30% target).

Papers naming benchmarks/datasets

Coverage is strong (36.4% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (18.2% vs 35% target).

Papers with known annotation unit

Coverage is usable but incomplete (27.3% vs 35% target).

Known Limitations

Rater population is under-specified (18.2% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.
Cross-page comparisons should be benchmark- and metric-matched to avoid protocol confounding.

Research Utility Links

Benchmark Slice: MMLU - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: calibration - Finds papers where reported metrics are directly comparable.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

Benchmark Brief

MMLU

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention MMLU.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Humanity's Last Exam

Benchmark Brief

Retrieval

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention Retrieval.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Humanity's Last Exam

Benchmark Brief

GSM8K

Coverage: 1 papers (9.1%)

1 papers (9.1%) mention GSM8K.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Metric Brief

calibration

Coverage: 11 papers (100%)

11 papers (100%) mention calibration.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment

Metric Brief

accuracy

Coverage: 4 papers (36.4%)

4 papers (36.4%) mention accuracy.

Examples: Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning , CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning

Metric Brief

cost

Coverage: 2 papers (18.2%)

2 papers (18.2%) mention cost.

Examples: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Who can we trust? LLM-as-a-jury for Comparative Assessment

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics Math

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026

Automatic Metrics General

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Automatic Metrics Coding

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026

Automatic Metrics General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026

Automatic Metrics MedicineCoding

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025

Automatic Metrics General

The Mar Menor, Europe's largest hypersaline coastal lagoon, located in southeastern Spain, has undergone severe eutrophication crises, with devastating impacts on biodiversity and water quality.
LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025

Automatic Metrics General

A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025

Automatic Metrics General

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis · May 5, 2025

Automatic Metrics General

Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal c
Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025

Automatic Metrics Math

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.

Calibration In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs