HFEPX Hub

Automatic Metrics + General Papers

Updated from current HFEPX corpus (Feb 27, 2026). 514 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Trajectory. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 514 Last published: Feb 26, 2026 Global RSS Tag RSS

Automatic MetricsGeneral

Research Narrative

Grounded narrative Model: deterministic-grounded Source: persisted

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 514 papers for Automatic Metrics + General Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, DROP and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

14.4% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Moral Preferences of LLMs Under Directed Contextual Influence , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
automatic metrics appears in 100% of papers in this hub.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Protocol Takeaways

Most common quality-control signal is rater calibration (3.3% of papers).

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Rater context is mostly domain experts, and annotation is commonly trajectory-level annotation; use this to scope replication staffing.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Benchmark Interpretation

Retrieval appears in 10.7% of hub papers (55/514); use this cohort for benchmark-matched comparisons.
DROP appears in 1.8% of hub papers (9/514); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 22.8% of hub papers (117/514); compare with a secondary metric before ranking methods.
cost is reported in 7.6% of hub papers (39/514); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (14.4% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (4.3% vs 30% target).
Close gap on Papers naming benchmarks/datasets. Coverage is a replication risk (20.2% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (44.9% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (6.6% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (10.3% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (14.4% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (4.3% vs 30% target).

Papers naming benchmarks/datasets

Coverage is a replication risk (20.2% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (44.9% vs 35% target).

Papers with known rater population

Coverage is a replication risk (6.6% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (10.3% vs 35% target).

Known Limitations

Only 4.3% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (6.6% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=4, right_only=2

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=4, left_only=0, right_only=510

4 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=2, left_only=0, right_only=512

2 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 55 papers (10.7%)

55 papers (10.7%) mention Retrieval.

Examples: MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training , Personalized Graph-Empowered Large Language Model for Proactive Information Access

Benchmark Brief

DROP

Coverage: 9 papers (1.8%)

9 papers (1.8%) mention DROP.

Examples: Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models , Beyond Subtokens: A Rich Character Embedding for Low-resource and Morphologically Complex Languages , Online Algorithms with Unreliable Guidance

Benchmark Brief

MMLU

Coverage: 6 papers (1.2%)

6 papers (1.2%) mention MMLU.

Examples: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference , D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models , Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Metric Brief

accuracy

Coverage: 117 papers (22.8%)

117 papers (22.8%) mention accuracy.

Examples: Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody , Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Metric Brief

cost

Coverage: 39 papers (7.6%)

39 papers (7.6%) mention cost.

Examples: Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue , Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization

Metric Brief

Coverage: 20 papers (3.9%)

20 papers (3.9%) mention f1.

Examples: A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features , Probing for Knowledge Attribution in Large Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Top Papers

LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Soumya Dutta, Smruthi Balaji, Sriram Ganapathy · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems.
Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026 · Citations: 0

Automatic Metrics

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations.
MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa · Feb 26, 2026 · Citations: 0

Automatic Metrics

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee · Feb 26, 2026 · Citations: 0

Automatic Metrics

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.
Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg, Oren Gal · Feb 26, 2026 · Citations: 0

Automatic Metrics

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream?
OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong · Feb 26, 2026 · Citations: 0

Automatic Metrics Tool Use

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
Mohammad Yeghaneh Abkenar, Weixing Wang, Manfred Stede, Davide Picca, Mark A. Finlayson · Feb 26, 2026 · Citations: 0

Automatic Metrics

Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic.
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian · Feb 26, 2026 · Citations: 0

Automatic Metrics

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian.
Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026 · Citations: 0

Automatic Metrics

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0

Critique Edit Automatic Metrics

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple res
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

We introduce AuditBench, an alignment auditing benchmark.
Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Siyue Su, Jian Yang, Bo Li, Guanglin Niu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents.
Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies
Shinnosuke Nozue, Yuto Nakano, Yotaro Watanabe, Meguru Takasaki, Shoji Moriya · Feb 26, 2026 · Citations: 0

Automatic Metrics

Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
Ruyi2 Technical Report
Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format
Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality.
Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
Samay Bhojwani, Swarnima Kain, Lisong Xu · Feb 26, 2026 · Citations: 0

Automatic Metrics

These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
Ryan Liu, Dilip Arumugam, Cedegao E. Zhang, Sean Escola, Xaq Pitkow · Feb 26, 2026 · Citations: 0

Automatic Metrics

This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms.
Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
An-Ci Peng, Kuan-Tang Huang, Tien-Hong Lo, Hung-Shin Lee, Hsin-Min Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin).
Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička, Hana Bednářová · Feb 25, 2026 · Citations: 0

Automatic Metrics

The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons.
Mind the Gap in Cultural Alignment: Task-Aware Culture Management for Large Language Models
Binchi Zhang, Xujiang Zhao, Jundong Li, Haifeng Chen, Zhengzhang Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) are increasingly deployed in culturally sensitive real-world tasks.
A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026 · Citations: 0

Automatic Metrics

Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
Detecting Hate and Inflammatory Content in Bengali Memes: A New Multimodal Dataset and Co-Attention Framework
Rakib Ullah, Mominul islam, Md Sanjid Hossain, Md Ismail Hossain · Feb 25, 2026 · Citations: 0

Automatic Metrics

Internet memes have become a dominant form of expression on social media, including within the Bengali-speaking community.
Scaling In, Not Up? Testing Thick Citation Context Analysis with GPT-5 and Fragile Prompts
Arno Simons · Feb 25, 2026 · Citations: 0

Automatic Metrics

This paper tests whether large language models (LLMs) can support interpretative citation context analysis (CCA) by scaling in thick, text-grounded readings of a single hard case rather than scaling up typological labels.
LiCQA : A Lightweight Complex Question Answering System
Sourav Saha, Dwaipayan Roy, Mandar Mitra · Feb 25, 2026 · Citations: 0

Automatic Metrics

The results of our experiments show that LiCQA significantly outperforms these two state-of-the-art systems on benchmark data with noteworthy reduction in latency.
Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads
Kunpeng Zhang, Poppy Zhang, Shawndra Hill, Amel Awadelkarim · Feb 25, 2026 · Citations: 0

Automatic Metrics

Traditional methods often miss the nuanced interplay of these components, requiring advanced frameworks for thorough evaluation.
When AI Writes, Whose Voice Remains? Quantifying Cultural Marker Erasure Across World English Varieties in Large Language Models
Satyam Kumar Navneet, Joydeep Chandra, Yong Zhang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) are increasingly used to ``professionalize'' workplace communication, often at the cost of linguistic identity.
Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0

Automatic Metrics Tool Use

Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
Understanding Artificial Theory of Mind: Perturbed Tasks and Reasoning in Large Language Models
Christian Nickel, Laura Schrewe, Florian Mai, Lucie Flek · Feb 25, 2026 · Citations: 0

Automatic Metrics

Theory of Mind (ToM) refers to an agent's ability to model the internal states of others.
A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Estève, Christophe Servan, Thomas Lavergne, Agata Savary · Feb 25, 2026 · Citations: 0

Automatic Metrics

Diversity has been gaining interest in the NLP community in recent years.
CxMP: A Linguistic Minimal-Pair Benchmark for Evaluating Constructional Understanding in Language Models
Miyu Oba, Saku Sugawara · Feb 25, 2026 · Citations: 0

Automatic Metrics

Most existing benchmarks focus on judging grammatical acceptability, whereas the ability to interpret meanings conveyed by grammatical forms has received much less attention.
RADAR: Reasoning as Discrimination with Aligned Representations for LLM-based Knowledge Graph Reasoning
Bo Xue, Yuan Jin, Luoyi Fu, Jiaxin Ding, Xinbing Wang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Across four benchmarks, RADAR achieves 5-6% relative gains on link prediction and triple classification over strong LLM baselines, while increasing task-relevant mutual information in intermediate representations by 62.9%, indicating more r
Large Language Models are Algorithmically Blind
Sohan Venkatesh, Ashish Mahendran Kurapath, Tejas Melkote · Feb 25, 2026 · Citations: 0

Automatic Metrics

Models produce ranges far wider than true confidence intervals yet still fail to contain the true algorithmic mean in the majority of instances; most perform worse than random guessing and the marginal above-random performance of the best m
DynamicGTR: Leveraging Graph Topology Representation Preferences to Boost VLM Capabilities on Graph QAs
Yanbin Wei, Jiangyue Yan, Chun Kang, Yang Chen, Hua Liu · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

This ``one-size-fits-all'' strategy often neglects model-specific and task-specific preferences, resulting in inaccurate or over-lengthy responses to graph-related queries.
Personalized Graph-Empowered Large Language Model for Proactive Information Access
Chia Cheng Chang, An-Zi Yen, Hen-Hsen Huang, Hsin-Hsi Chen · Feb 25, 2026 · Citations: 0

Automatic Metrics

Since individuals may struggle to recall all life details and often confuse events, establishing a system to assist users in recalling forgotten experiences is essential.
Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026 · Citations: 0

Human EvalAutomatic Metrics

Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
FewMMBench: A Benchmark for Multimodal Few-Shot Learning
Mustafa Dogan, Ilker Kesen, Iacer Calixto, Aykut Erdem, Erkut Erdem · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

In this paper, we introduce FewMMBench, a comprehensive benchmark designed to evaluate MLLMs under few-shot conditions, with a focus on In-Context Learning (ICL) and Chain-of-Thought (CoT) prompting.
JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem
Heejin Jo · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large language models consistently fail the "car wash problem," a viral reasoning benchmark requiring implicit physical constraint inference.
D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

Chain-of-Thought (CoT) distillation from Large Language Models (LLMs) often induces "overthinking" in Small Language Models (SLMs), leading to performance degradation and excessive token consumption.
fEDM+: A Risk-Based Fuzzy Ethical Decision Making Framework with Principle-Level Explainability and Pluralistic Validation
Abeer Dyoub, Francesca A. Lisi · Feb 25, 2026 · Citations: 0

Automatic Metrics

In a previous work, we introduced the fuzzy Ethical Decision-Making framework (fEDM), a risk-based ethical reasoning architecture grounded in fuzzy logic.
The ASIR Courage Model: A Phase-Dynamic Framework for Truth Transitions in Human and AI Systems
Hyo Jin Kim · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Although initially formulated for human truth-telling under asymmetric stakes, the same phase-dynamic architecture extends to AI systems operating under policy constraints and alignment filters.
Explore-on-Graph: Incentivizing Autonomous Exploration of Large Language Models on Knowledge Graphs with Path-refined Reward Modeling
Shiqi Yan, Yubo Chen, Ruiqi Zhou, Zhengxi Yao, Shuai Chen · Feb 25, 2026 · Citations: 0

Demonstrations Automatic Metrics

Extensive experiments on five KGQA benchmark datasets demonstrate that, to the best of our knowledge, our method achieves state-of-the-art performance, outperforming not only open-source but also even closed-source LLMs.
Evaluating the relationship between regularity and learnability in recursive numeral systems using Reinforcement Learning
Andrea Silvi, Ponrawee Prasertsom, Jennifer Culbertson, Devdatt Dubhashi, Moa Johansson · Feb 25, 2026 · Citations: 0

Automatic Metrics

Human recursive numeral systems (i.e., counting systems such as English base-10 numerals), like many other grammatical systems, are highly regular.
Dynamic Multimodal Activation Steering for Hallucination Mitigation in Large Vision-Language Models
Jianghao Yin, Qin Chen, Kedi Chen, Jie Zhou, Xingjiao Wu · Feb 25, 2026 · Citations: 0

Automatic Metrics

Large Vision-Language Models (LVLMs) exhibit outstanding performance on vision-language tasks but struggle with hallucination problems.
Hierarchical LLM-Based Multi-Agent Framework with Prompt Optimization for Multi-Robot Task Planning
Tomoya Kawabe, Rin Takano · Feb 25, 2026 · Citations: 0

Automatic Metrics Long Horizon

We present a hierarchical multi-agent LLM-based planner with prompt optimization: an upper layer decomposes tasks and assigns them to lower-layer agents, which generate PDDL problems solved by a classical planner.
DWA-KD: Dual-Space Weighting and Time-Warped Alignment for Cross-Tokenizer Knowledge Distillation
Duc Trung Vu, Pham Khanh Chi, Dat Phi Van, Linh Ngo Van, Sang Dinh · Feb 25, 2026 · Citations: 0

Automatic Metrics

Extensive experiments across diverse NLP benchmarks demonstrate that DWA-KD outperforms state-of-the-art KD baselines, while ablation studies confirm the complementary contributions of entropy-based token weighting and embedding and final h
CCCaption: Dual-Reward Reinforcement Learning for Complete and Correct Image Captioning
Zhijiang Tang, Linhua Wang, Jiaxin Qi, Weihao Jiang, Peng Hou · Feb 25, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Image captioning remains a fundamental task for vision language understanding, yet ground-truth supervision still relies predominantly on human-annotated references.

Automatic Metrics + General Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs