Metric Hub

Accuracy In CS.AI Papers

Updated from current HFEPX corpus (Feb 27, 2026). 142 papers are grouped in this metric page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 142 Last published: Feb 26, 2026 Global RSS

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 142 papers for Accuracy In CS.AI Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

7.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
automatic metrics appears in 100% of papers in this hub.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment , MoDora: Tree-Based Semi-Structured Document Analysis System
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance , Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem

Protocol Takeaways

Most common quality-control signal is rater calibration (3.5% of papers).

Evidence: A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: Distill and Align Decomposition for Enhanced Claim Verification , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Benchmark Interpretation

Retrieval appears in 9.9% of hub papers (14/142); use this cohort for benchmark-matched comparisons.
MATH appears in 4.9% of hub papers (7/142); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 100% of hub papers (142/142); compare with a secondary metric before ranking methods.
cost is reported in 11.3% of hub papers (16/142); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (7.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (7% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (29.6% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (100% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (11.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (8.5% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (7.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (7% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (29.6% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (100% vs 35% target).

Papers with known rater population

Coverage is a replication risk (11.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (8.5% vs 35% target).

Known Limitations

Only 7% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (11.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=1, left_only=0, right_only=141

1 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=11, left_only=131, right_only=0

11 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=11, right_only=1

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 14 papers (9.9%)

14 papers (9.9%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance

Benchmark Brief

MATH

Coverage: 7 papers (4.9%)

7 papers (4.9%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

GSM8K

Coverage: 4 papers (2.8%)

4 papers (2.8%) mention GSM8K.

Examples: Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference , Weight space Detection of Backdoors in LoRA Adapters

Metric Brief

accuracy

Coverage: 142 papers (100%)

142 papers (100%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Metric Brief

cost

Coverage: 16 papers (11.3%)

16 papers (11.3%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision? , From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators

Metric Brief

latency

Coverage: 10 papers (7%)

10 papers (7%) mention latency.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG , Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers Reporting This Metric

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026

Automatic Metrics MathCoding

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026

Automatic Metrics General

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026

Automatic Metrics General

Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026

Automatic Metrics Coding

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026

Automatic Metrics MathCoding

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026

Automatic Metrics General

Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026

Automatic Metrics MathMedicine

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026

Automatic Metrics General

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026

Automatic Metrics MedicineCoding

We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026

Automatic Metrics General

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026

Automatic Metrics Coding

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026

Automatic Metrics Coding

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026

Automatic Metrics Coding

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026

Automatic Metrics General

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026

Automatic Metrics General

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026

Human EvalAutomatic Metrics General

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026

Automatic Metrics General

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026

Automatic Metrics MedicineCoding

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026

Automatic Metrics General

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026

Automatic Metrics Medicine

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026

Automatic Metrics MathCoding

Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026

Automatic Metrics General

On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models
Changli Tang, Shurui Li, Junliang Wang, Qinfan Xiao, Zhonghao Zhai · Feb 25, 2026

Automatic Metrics Coding

Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks.
GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026

Automatic Metrics General

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems.
Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026

Automatic Metrics Math

Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026

Automatic Metrics Math

We validate across five benchmarks, five models from three families, and both synthetic and real data.
XMorph: Explainable Brain Tumor Analysis Via LLM-Assisted Hybrid Deep Intelligence
Sepehr Salem Ghahfarokhi, M. Moein Esfahani, Raj Sunderraman, Vince Calhoun, Mohammed Alser · Feb 24, 2026

Automatic Metrics MedicineCoding

Deep learning has significantly advanced automated brain tumor diagnosis, yet clinical adoption remains limited by interpretability and computational constraints.
Attention-Based SINR Estimation in User-Centric Non-Terrestrial Networks
Bruno De Filippo, Alessandro Guidotti, Alessandro Vanelli-Coralli · Feb 24, 2026

Automatic Metrics General

These results enable the integration of DMHSA-based estimators into scheduling procedures, allowing the evaluation of multiple candidate user groups and the selection of those offering the highest average SINR and capacity.
HELP: HyperNode Expansion and Logical Path-Guided Evidence Localization for Accurate and Efficient GraphRAG
Yuqi Huang, Ning Liao, Kai Yang, Anning Hu, Shengchao Hu · Feb 24, 2026

Automatic Metrics General

Extensive experiments demonstrate that HELP achieves competitive performance across multiple simple and multi-hop QA benchmarks and up to a 28.8$\times$ speedup over leading Graph-based RAG baselines.
Predicting Sentence Acceptability Judgments in Multimodal Contexts
Hyewon Jang, Nikolai Ilinykh, Sharid Loáiciga, Jey Han Lau, Shalom Lappin · Feb 24, 2026

Automatic Metrics General

Previous work has examined the capacity of deep neural networks (DNNs), particularly transformers, to predict human sentence acceptability judgments, both independently of context, and in document contexts.
Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs
Dhita Putri Pratama, Soyeon Caren Han, Yihao Ding · Feb 24, 2026

Automatic Metrics Coding

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning.
OrthoDiffusion: A Generalizable Multi-Task Diffusion Foundation Model for Musculoskeletal MRI Interpretation
Tian Lan, Lei Xu, Zimu Yuan, Shanggui Liu, Jiajun Liu · Feb 24, 2026

Automatic Metrics Medicine

Our evaluation demonstrates that OrthoDiffusion achieves excellent performance in the segmentation of 11 knee structures and the detection of 8 knee abnormalities.
Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams
Darvan Shvan Khairaldeen, Hossein Hassani · Feb 24, 2026

Automatic Metrics General

On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% .
Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026

Automatic MetricsSimulation Env Coding

Inspecting Chain-of-Thought reasoning is among the most common means of understanding why an LLM produced its output.
CAMEL: Confidence-Gated Reflection for Reward Modeling
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar · Feb 24, 2026

Automatic Metrics General

Reward models play a fundamental role in aligning large language models with human preferences.
Enhancing Hate Speech Detection on Social Media: A Comparative Analysis of Machine Learning Models and Text Transformation Approaches
Saurabh Mishra, Shivani Thakur, Radhika Mamidi · Feb 24, 2026

Automatic Metrics General

The proliferation of hate speech on social media platforms has necessitated the development of effective detection and moderation tools.
Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji · Feb 24, 2026

Automatic Metrics General

Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties.
To Reason or Not to: Selective Chain-of-Thought in Medical Question Answering
Zaifu Zhan, Min Zeng, Shuang Zhou, Yiran Song, Xiaoyi Chen · Feb 23, 2026

Automatic Metrics Medicine

Two open-source LLMs (Llama-3.1-8B and Qwen-2.5-7B) were evaluated on four biomedical QA benchmarks-HeadQA, MedQA-USMLE, MedMCQA, and PubMedQA.
NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026

Automatic Metrics Coding

Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
Multilingual Large Language Models do not comprehend all natural languages to equal degrees
Natalia Moskvina, Raquel Montero, Masaya Yoshida, Ferdy Hubers, Paolo Morosi · Feb 23, 2026

Automatic Metrics Multilingual

Large Language Models (LLMs) play a critical role in how humans access information.
When Pretty Isn't Useful: Investigating Why Modern Text-to-Image Models Fail as Reliable Training Data Generators
Krzysztof Adamkiewicz, Brian Moser, Stanislav Frolov, Tobias Christian Nauen, Federico Raue · Feb 23, 2026

Automatic Metrics General

Recent text-to-image (T2I) diffusion models produce visually stunning images and demonstrate excellent prompt following.
Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026

Automatic Metrics MedicineCoding

The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
Classroom Final Exam: An Instructor-Tested Reasoning Benchmark
Chongyang Gao, Diji Yang, Shuyan Zhou, Xichen Yan, Luchuan Song · Feb 23, 2026

Automatic Metrics Coding

We introduce \CFE{} (\textbf{C}lassroom \textbf{F}inal \textbf{E}xam), a multimodal benchmark for evaluating the reasoning capabilities of large language models across more than 20 STEM domains.
Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026

Automatic Metrics Math

In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations
Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore · Feb 22, 2026

Automatic Metrics General

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows.
Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026

Automatic Metrics Multilingual

Personal AI agents incur substantial cost via repeated LLM calls.
ArabicNumBench: Evaluating Arabic Number Reading in Large Language Models
Anas Alhumud, Abdulaziz Alhammadi, Muhammad Badruddin Khan · Feb 21, 2026

Automatic Metrics General

We present ArabicNumBench, a comprehensive benchmark for evaluating large language models on Arabic number reading tasks across Eastern Arabic-Indic numerals (0-9 in Arabic script) and Western Arabic numerals (0-9).
Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026

Automatic Metrics Coding

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026

Automatic MetricsSimulation Env Coding

Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.
Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026

Automatic MetricsSimulation Env General

When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026

Automatic Metrics MathCoding

Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
FENCE: A Financial and Multimodal Jailbreak Detection Dataset
Mirae Kim, Seonghun Jeong, Youngjun Kwak · Feb 20, 2026

Automatic Metrics General

A baseline detector trained on FENCE achieves 99 percent in-distribution accuracy and maintains strong performance on external benchmarks, underscoring the dataset's robustness for training reliable detection models.
Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026

Automatic Metrics Law

Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026

Automatic Metrics Math

Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
CLEF HIPE-2026: Evaluating Accurate and Efficient Person-Place Relation Extraction from Multilingual Historical Texts
Juri Opitz, Corina Raclé, Emanuela Boros, Andrianos Michail, Matteo Romanello · Feb 19, 2026

Automatic Metrics Multilingual

HIPE-2026 is a CLEF evaluation lab dedicated to person-place relation extraction from noisy, multilingual historical texts.
The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?
Jayadev Billa · Feb 19, 2026

Automatic Metrics General

Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades.
Evaluating Chain-of-Thought Reasoning through Reusability and Verifiability
Shashank Aggarwal, Ram Vikas Mishra, Amit Awekar · Feb 19, 2026

Automatic Metrics General

In multi-agent IR pipelines for tasks such as search and ranking, LLM-based agents exchange intermediate reasoning in terms of Chain-of-Thought (CoT) with each other.
What Do LLMs Associate with Your Name? A Human-Centered Black-Box Audit of Personal Data
Dimitri Staufer, Kirsten Morehouse · Feb 19, 2026

Automatic Metrics General

Large language models (LLMs), and conversational agents based on them, are exposed to personal data (PD) during pre-training and during user interactions.
Mechanistic Interpretability of Cognitive Complexity in LLMs via Linear Probing using Bloom's Taxonomy
Bianca Raimondi, Maurizio Gabbrielli · Feb 19, 2026

Automatic Metrics Coding

The black-box nature of Large Language Models necessitates novel evaluation frameworks that transcend surface-level performance metrics.
ALPS: A Diagnostic Challenge Set for Arabic Linguistic & Pragmatic Reasoning
Hussein S. Al-Olimat, Ahmad Alshareef · Feb 19, 2026

Automatic Metrics Multilingual

While recent Arabic NLP benchmarks focus on scale, they often rely on synthetic or translated data which may benefit from deeper linguistic verification.

Accuracy In CS.AI Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers Reporting This Metric

Other Metric Hubs