HFEPX Hub

Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 26, 2026). 876 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Frequently cited benchmark: retrieval. Common metric signal: accuracy. Newest paper in this set is from Feb 25, 2026.

Papers: 876 Last published: Feb 25, 2026 Global RSS Tag RSS

Automatic Metrics

Why This Matters For Eval Research

Common evaluation patterns here: Automatic Metrics, Simulation Env.
Benchmark signals emphasize: retrieval, MATH.
Top reported metrics include: accuracy, cost.

Research Utility Snapshot

Human Feedback Mix

Pairwise Preference (51)
Expert Verification (19)
Red Team (19)
Critique Edit (16)

Evaluation Modes

Automatic Metrics (876)
Simulation Env (20)
Human Eval (9)
Llm As Judge (2)

Top Benchmarks

Retrieval (86)
MATH (28)
MMLU (15)
GSM8K (11)

Top Metrics

Accuracy (218)
Cost (66)
F1 (32)
Latency (31)

Top Papers

Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

Automatic Metrics

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0

Automatic Metrics

Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script.
Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0

Automatic Metrics

We study reasoning for accessing world knowledge stored in a language model's parameters.
GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0

Automatic Metrics

The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
IndicIFEval: A Benchmark for Verifiable Instruction-Following Evaluation in 14 Indic Languages
Thanmay Jayakumar, Mohammed Safi Ur Rahman Khan, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan · Feb 25, 2026 · Citations: 0

Automatic Metrics

Instruction-following benchmarks remain predominantly English-centric, leaving a critical evaluation gap for the hundreds of millions of Indic language speakers.
SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0

Automatic Metrics

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu · Feb 25, 2026 · Citations: 0

Automatic Metrics

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United Sta
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary · Feb 25, 2026 · Citations: 0

Automatic Metrics

Diversity has been gaining interest in the NLP community in recent years.
CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
Miyu Oba, Saku Sugawara · Feb 25, 2026 · Citations: 0

Automatic Metrics

Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention.
RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more r
MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote · Feb 25, 2026 · Citations: 0

Automatic Metrics

Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best m
Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0

Automatic MetricsSimulation Env

Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
ExpLang: Improved Exploration and Exploitation in LLM Reasoning with On-Policy Thinking Language Selection
Changjiang Gao, Zixian Huang, Kaichen Yang, Jiajun Chen, Jixing Li · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Analysis shows that, by enabling on-policy thinking language selection as an action during RL, ExpLang effectively extends the RL exploration space with diversified language preference and improves the RL exploitation outcome with leveraged
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
xai-cola: A Python library for sparsifying counterfactual explanations
Lin Zhu, Lei You · Feb 25, 2026 · Citations: 0

Automatic Metrics

Counterfactual explanation (CE) is an important domain within post-hoc explainability.
JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit · Feb 25, 2026 · Citations: 0

Automatic Metrics

We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis.
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption.
fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0

Automatic Metrics

In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
MD. Sagor Chowdhury, Adiba Fairooz Chowdhury · Feb 25, 2026 · Citations: 0

Automatic Metrics

We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng · Feb 25, 2026 · Citations: 0

Automatic Metrics

Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement
SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0

Expert Verification Automatic Metrics

Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final h
Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.
Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
PPCR-IM: A System for Multi-layer DAG-based Public Policy Consequence Reasoning and Social Indicator Mapping
Zichen Song, Weijia Li · Feb 25, 2026 · Citations: 0

Automatic Metrics

For each policy episode, the system outputs a structured record containing the DAG, indicator mappings, and three evaluation measures: an expected-indicator coverage score, a discovery rate for overlooked but relevant indicators, and a rela
Mitigating Structural Noise in Low-Resource S2TT: An Optimized Cascaded Nepali-English Pipeline with Punctuation Restoration
Tangsang Chongbang, Pranesh Pyara Shrestha, Amrit Sarki, Anku Jaiswal · Feb 25, 2026 · Citations: 0

Automatic Metrics

We first establish highly proficient ASR and NMT components: a Wav2Vec2-XLS-R-300m model achieved a state-of-the-art 2.72% CER on OpenSLR-54, and a multi-stage fine-tuned MarianMT model reached a 28.32 BLEU score on the FLORES-200 benchmark
Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
Multi-dimensional Assessment and Explainable Feedback for Counselor Responses to Client Resistance in Text-based Counseling with LLMs
Anqi Li, Ruihan Wang, Zhaoming Chen, Yuqian Chen, Yu Lu · Feb 25, 2026 · Citations: 0

Automatic Metrics

Although current NLP research has examined overall counseling quality and general therapeutic skills, it fails to provide granular evaluations of high-stakes moments where clients exhibit resistance.
RuCL: Stratified Rubric-Based Curriculum Learning for Multimodal Large Language Model Reasoning
Yukun Chen, Jiaming Li, Longze Chen, Ze Gong, Jingpeng Li · Feb 25, 2026 · Citations: 0

Rubric Rating Automatic Metrics

Extensive experiments on various visual reasoning benchmarks show that RuCL yields a remarkable +7.83% average improvement over the Qwen2.5-VL-7B model, achieving a state-of-the-art accuracy of 60.06%.
When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
Muku Akasaka, Soyeon Caren Han · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks.
Virtual Biopsy for Intracranial Tumors Diagnosis on MRI
Xinzhe Luo, Shuai Shao, Yan Wang, Jiangtao Wang, Yutong Bai · Feb 25, 2026 · Citations: 0

Automatic Metrics

To address these challenges, we construct the ICT-MRI dataset - the first public biopsy-verified benchmark with 249 cases across four categories.
Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
Exploring Human-Machine Coexistence in Symmetrical Reality
Zhenliang Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

In the context of the evolution of artificial intelligence (AI), the interaction between humans and AI entities has become increasingly salient, challenging the conventional human-centric paradigms of human-machine interaction.
Power and Limitations of Aggregation in Compound AI Systems
Nivasini Ananthakrishnan, Meena Jagadeesan · Feb 25, 2026 · Citations: 0

Automatic Metrics

In this work, we investigate the power and limitations of aggregation within a stylized principal-agent framework.
Revisiting RAG Retrievers: An Information Theoretic Benchmark
Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert · Feb 25, 2026 · Citations: 0

Automatic Metrics

Existing benchmarks primarily compare entire RAG pipelines or introduce new datasets, providing little guidance on selecting or combining retrievers themselves.
From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
Enhancing Multilingual Embeddings via Multi-Way Parallel Text Alignment
Barah Fazili, Koustava Goswami · Feb 25, 2026 · Citations: 0

Automatic Metrics

This leads to substantial performance gains across both seen and unseen languages for multiple tasks from the MTEB benchmark evaluated for XLM-Roberta and multilingual BERT base models.
One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models
Changli Tang, Shurui Li, Junliang Wang, Qinfan Xiao, Zhonghao Zhai · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks.
Training Generalizable Collaborative Agents via Strategic Risk Aversion
Chengrui Qu, Yizhou Zhang, Nicholas Lanzetti, Eric Mazumdar · Feb 25, 2026 · Citations: 0

Automatic Metrics Multi Agent

Many emerging agentic paradigms require agents to collaborate with one another (or people) to achieve shared goals.
Beyond Refusal: Probing the Limits of Agentic Self-Correction for Semantic Sensitive Information
Umid Suleymanov, Zaur Rajabov, Emil Mirzazada, Murat Kantarcioglu · Feb 25, 2026 · Citations: 0

Critique Edit Automatic Metrics

To address this, we introduce SemSIEdit, an inference-time framework where an agentic "Editor" iteratively critiques and rewrites sensitive spans to preserve narrative flow rather than simply refusing to answer.
GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems.

Automatic Metrics Papers

Why This Matters For Eval Research

Research Utility Snapshot

Top Papers

Related Hubs