HFEPX Hub

CS.AI + Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 27, 2026). 533 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 533 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.AIAutomatic Metrics

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 533 papers for CS.AI + Automatic Metrics Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

13.7% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
automatic metrics appears in 100% of papers in this hub.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Protocol Takeaways

Most common quality-control signal is rater calibration (2.8% of papers).

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pair this hub with llm_as_judge pages to benchmark automated-vs-human evaluation tradeoffs.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Benchmark Interpretation

Retrieval appears in 9.9% of hub papers (53/533); use this cohort for benchmark-matched comparisons.
MATH appears in 2.3% of hub papers (12/533); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 26.6% of hub papers (142/533); compare with a secondary metric before ranking methods.
cost is reported in 8.6% of hub papers (46/533); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (13.7% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (3.9% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (24% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (49% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (10.3% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (9.6% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (13.7% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (3.9% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (24% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (49% vs 35% target).

Papers with known rater population

Coverage is a replication risk (10.3% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (9.6% vs 35% target).

Known Limitations

Only 3.9% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (10.3% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Human Eval Protocols - Surfaces human-rating workflows for rubric and annotator quality analysis.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs automatic_metrics

both=3, left_only=0, right_only=530

3 papers use both Human Eval and Automatic Metrics.

automatic_metrics vs simulation_env

both=13, left_only=520, right_only=0

13 papers use both Automatic Metrics and Simulation Env.

simulation_env vs human_eval

both=0, left_only=13, right_only=3

0 papers use both Simulation Env and Human Eval.

Benchmark Brief

Retrieval

Coverage: 53 papers (9.9%)

53 papers (9.9%) mention Retrieval.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , MoDora: Tree-Based Semi-Structured Document Analysis System , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Benchmark Brief

MATH

Coverage: 12 papers (2.3%)

12 papers (2.3%) mention MATH.

Examples: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards , RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models

Benchmark Brief

DROP

Coverage: 8 papers (1.5%)

8 papers (1.5%) mention DROP.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration

Metric Brief

accuracy

Coverage: 142 papers (26.6%)

142 papers (26.6%) mention accuracy.

Examples: AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Metric Brief

cost

Coverage: 46 papers (8.6%)

46 papers (8.6%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue , How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?

Metric Brief

Coverage: 20 papers (3.8%)

20 papers (3.8%) mention f1.

Examples: SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , Probing for Knowledge Attribution in Large Language Models , A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables , AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models

Top Papers

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0

Automatic Metrics

Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0

Automatic Metrics Multi Agent

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa · Feb 26, 2026 · Citations: 0

Automatic Metrics

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture.
Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee · Feb 26, 2026 · Citations: 0

Automatic Metrics

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.
NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong · Feb 26, 2026 · Citations: 0

Automatic Metrics Tool Use

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods.
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0

Critique Edit Automatic Metrics

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026 · Citations: 0

Automatic Metrics

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Siyue Su, Jian Yang, Bo Li, Guanglin Niu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents.
dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs.
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
Ruyi2 Technical Report
Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.
Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
Samay Bhojwani, Swarnima Kain, Lisong Xu · Feb 26, 2026 · Citations: 0

Automatic Metrics

These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
Ryan Liu, Dilip Arumugam, Cedegao E. Zhang, Sean Escola, Xaq Pitkow · Feb 26, 2026 · Citations: 0

Automatic Metrics

This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms.
Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
An-Ci Peng, Kuan-Tang Huang, Tien-Hong Lo, Hung-Shin Lee, Hsin-Min Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin).
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0

Automatic Metrics

We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička, Hana Bednářová · Feb 25, 2026 · Citations: 0

Automatic Metrics

The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons.
VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026 · Citations: 0

Automatic Metrics

An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026 · Citations: 0

Automatic Metrics

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
Arno Simons · Feb 25, 2026 · Citations: 0

Automatic Metrics

This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels.
Decoder-based Sense Knowledge Distillation
Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

Automatic Metrics

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads
Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim · Feb 25, 2026 · Citations: 0

Automatic Metrics

Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0

Automatic Metrics

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
xai-cola: A Python library for sparsifying counterfactual explanations
Lin Zhu, Lei You · Feb 25, 2026 · Citations: 0

Automatic Metrics

Counterfactual explanation (CE) is an important domain within post-hoc explainability.
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0

Automatic Metrics

In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng · Feb 25, 2026 · Citations: 0

Automatic Metrics

Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping
Zichen Song, Weijia Li · Feb 25, 2026 · Citations: 0

Automatic Metrics

For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a rela
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal · Feb 25, 2026 · Citations: 0

Automatic Metrics

We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026 · Citations: 0

Automatic Metrics

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.

CS.AI + Automatic Metrics Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs