Metric Hub

Calibration Metric Papers

Updated from current HFEPX corpus (Feb 26, 2026). 24 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequently cited benchmark: MMLU. Common metric signal: calibration. Newest paper in this set is from Feb 25, 2026.

Papers: 24 Last published: Feb 25, 2026 Global RSS

Research Utility Snapshot

Human Feedback Mix

Critique Edit (1)
Expert Verification (1)
Pairwise Preference (1)
Rubric Rating (1)

Agentic Evaluation

Long Horizon (1)
Multi Agent (1)
Tool Use (1)

Top Papers Reporting This Metric

Scalable Kernel-Based Distances for Statistical Inference and Integration
Masha Naslidnyk · Feb 25, 2026

Simulation Env Coding

Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning.
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026

Automatic Metrics Math

We validate across five benchmarks, five models from three families, and both synthetic and real data.
KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration
Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi · Feb 23, 2026

Automatic Metrics Math

Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-sp
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026

Automatic Metrics General

Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
Who can we trust? LLM-as-a-jury for Comparative Assessment
Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill · Feb 18, 2026

Automatic Metrics General

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements.
Team of Thoughts: Efficient Test-time Scaling of Agentic Systems through Orchestrated Tool Calling
Jeffrey T. H. Wong, Zixi Zhang, Junyi Liu, Yiren Zhao · Feb 18, 2026

Automatic Metrics Coding

Existing Multi-Agent Systems (MAS) typically rely on static, homogeneous model configurations, limiting their ability to exploit the distinct strengths of differently post-trained models.
Discrete Stochastic Localization for Non-autoregressive Generation
Yunshu Wu, Jiayi Cheng, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis · Feb 18, 2026

Automatic Metrics General

On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets.
MCPShield: A Security Cognition Layer for Adaptive Trust Calibration in Model Context Protocol Agents
Zhenhong Zhou, Yuanhe Zhang, Hongwei Cai, Moayad Aloqaily, Ouns Bouachir · Feb 15, 2026

Automatic Metrics General

The Model Context Protocol (MCP) standardizes tool use for LLM-based agents and enable third-party servers.
PMG: Parameterized Motion Generator for Human-like Locomotion Control
Chenxi Han, Yuheng Min, Zihao Huang, Ao Hong, Hang Liu · Feb 13, 2026

Automatic Metrics General

Recent advances in data-driven reinforcement learning and motion tracking have substantially improved humanoid locomotion, yet critical practical challenges remain.
Building Safe and Deployable Clinical Natural Language Processing under Temporal Leakage Constraints
Ha Na Cho, Sairam Sutari, Alexander Lopez, Hansen Bow, Kai Zheng · Jan 24, 2026

Automatic Metrics MedicineCoding

Such behavior poses substantial risks for real-world deployment, where overconfident or temporally invalid predictions can disrupt clinical workflows and compromise patient safety.
Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models
Haorui Yu, Xuehang Wen, Fengrui Zhang, Qiufeng Yi · Jan 12, 2026

Automatic Metrics Coding

Existing benchmarks assess perception without interpretation, and common evaluation proxies, such as automated metrics and LLM-judge averaging, are unreliable for culturally sensitive generative tasks.
WISE: Web Information Satire and Fakeness Evaluation
Gaurab Chhetri, Subasish Das, Tausif Islam Chowdhury · Dec 30, 2025

Automatic Metrics General

This study develops WISE (Web Information Satire and Fakeness Evaluation) framework which benchmarks eight lightweight transformer models alongside two baseline models on a balanced dataset of 20,000 samples from Fakeddit, annotated as eith
A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025

Automatic Metrics MedicineCoding

Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
Annotation-Efficient Universal Honesty Alignment
Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu · Oct 20, 2025

Automatic Metrics General

To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals.
Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025

Automatic Metrics General

The Mar Menor, Europe's largest hypersaline coastal lagoon, located in southeastern Spain, has undergone severe eutrophication crises, with devastating impacts on biodiversity and water quality.
LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025

Automatic Metrics General

A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025

Automatic Metrics General

Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025

Automatic Metrics MathCoding

Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia · Aug 21, 2025

Automatic Metrics Math

With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-
ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis · May 5, 2025

Automatic Metrics General

Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal c
Compressing Language Models for Specialized Domains
Miles Williams, George Chrysostomou, Vitor Jeronymo, Nikolaos Aletras · Feb 25, 2025

Automatic Metrics LawMedicine

Compression techniques such as pruning and quantization offer a practical path towards efficient LM deployment, exemplified by their ability to preserve performance on general-purpose benchmarks.
Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025

Automatic Metrics Math

Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.
Calibrating Large Language Models with Sample Consistency
Qing Lyu, Kumar Shridhar, Chaitanya Malaviya, Li Zhang, Yanai Elazar · Feb 21, 2024

Automatic Metrics General

We perform an extensive evaluation across various open and closed-source models on nine reasoning datasets.

Other Metric Hubs

Cost Metric Papers (78) Accuracy Metric Papers (218) Latency Metric Papers (34) Recall Metric Papers (33) F1 Metric Papers (32) Precision Metric Papers (31) Agreement Metric Papers (24) Throughput Metric Papers (15) Success Rate Metric Papers (14) Perplexity Metric Papers (14)