HFEPX Hub

CS.CL + Automatic Metrics Papers

Updated from current HFEPX corpus (Feb 27, 2026). 812 papers are grouped in this hub page. Common evaluation modes: Automatic Metrics, Simulation Env. Most common rater population: Domain Experts. Common annotation unit: Ranking. Frequent quality control: Calibration. Frequently cited benchmark: Retrieval. Common metric signal: accuracy. Use this page to compare protocol setup, judge behavior, and labeling design decisions before running new eval experiments. Newest paper in this set is from Feb 26, 2026.

Papers: 812 Last published: Feb 26, 2026 Global RSS Tag RSS

Cs.CLAutomatic Metrics

Research Narrative

Grounded narrative Model: deterministic-grounded Source: preview

Updated from current HFEPX corpus (Feb 27, 2026). This page tracks 812 papers for CS.CL + Automatic Metrics Papers. Dominant protocol signals include automatic metrics, simulation environments, human evaluation, with frequent benchmark focus on Retrieval, MATH and metric focus on accuracy, cost. Use the grounded sections below to prioritize reproducible protocol choices, benchmark-matched comparisons, and judge-vs-human evaluation checks.

Why This Matters For Eval Research

13.9% of papers report explicit human-feedback signals, led by pairwise preferences.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
automatic metrics appears in 100% of papers in this hub.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Retrieval is a recurring benchmark anchor for cross-paper comparisons in this page.

Evidence: MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

Protocol Takeaways

Most common quality-control signal is rater calibration (3.2% of papers).

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Rater context is mostly domain experts, and annotation is commonly ranking annotation; use this to scope replication staffing.

Evidence: LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Compare papers that report both human_eval and llm_as_judge to quantify judge-human agreement drift.

Evidence: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations , SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Benchmark Interpretation

Retrieval appears in 11.7% of hub papers (95/812); use this cohort for benchmark-matched comparisons.
MATH appears in 2.5% of hub papers (20/812); use this cohort for benchmark-matched comparisons.

Metric Interpretation

accuracy is reported in 25.9% of hub papers (210/812); compare with a secondary metric before ranking methods.
cost is reported in 8.7% of hub papers (71/812); compare with a secondary metric before ranking methods.

Researcher Checklist

Close gap on Papers with explicit human feedback. Coverage is a replication risk (13.9% vs 45% target).
Close gap on Papers reporting quality controls. Coverage is a replication risk (4.2% vs 30% target).
Tighten coverage on Papers naming benchmarks/datasets. Coverage is usable but incomplete (26.1% vs 35% target).
Maintain strength on Papers naming evaluation metrics. Coverage is strong (49% vs 35% target).
Close gap on Papers with known rater population. Coverage is a replication risk (9% vs 35% target).
Close gap on Papers with known annotation unit. Coverage is a replication risk (9.2% vs 35% target).

Papers with explicit human feedback

Coverage is a replication risk (13.9% vs 45% target).

Papers reporting quality controls

Coverage is a replication risk (4.2% vs 30% target).

Papers naming benchmarks/datasets

Coverage is usable but incomplete (26.1% vs 35% target).

Papers naming evaluation metrics

Coverage is strong (49% vs 35% target).

Papers with known rater population

Coverage is a replication risk (9% vs 35% target).

Papers with known annotation unit

Coverage is a replication risk (9.2% vs 35% target).

Known Limitations

Only 4.2% of papers report quality controls; prioritize calibration/adjudication evidence.
Rater population is under-specified (9% coverage).
Narrative synthesis is grounded in metadata and abstracts only; full-paper implementation details are not parsed.

Research Utility Links

Judge vs Human Agreement - Compares papers that evaluate with both human raters and LLM judges.
Benchmark Slice: Retrieval - Prioritizes benchmark-specific protocol comparisons.
Metric Slice: accuracy - Finds papers where reported metrics are directly comparable.
IAA-Reported Evaluations - Highlights evaluations that explicitly report inter-annotator agreement.
Recent High-Signal Papers - Keeps the hub connected to the latest HFEPX corpus updates.

human_eval vs llm_as_judge

both=0, left_only=9, right_only=2

0 papers use both Human Eval and Llm As Judge.

human_eval vs automatic_metrics

both=9, left_only=0, right_only=803

9 papers use both Human Eval and Automatic Metrics.

llm_as_judge vs automatic_metrics

both=2, left_only=0, right_only=810

2 papers use both Llm As Judge and Automatic Metrics.

Benchmark Brief

Retrieval

Coverage: 95 papers (11.7%)

95 papers (11.7%) mention Retrieval.

Examples: MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , MoDora: Tree-Based Semi-Structured Document Analysis System , TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought

Benchmark Brief

MATH

Coverage: 20 papers (2.5%)

20 papers (2.5%) mention MATH.

Examples: Improving Parametric Knowledge Access in Reasoning Language Models , GATES: Self-Distillation under Privileged Context with Consensus Gating , Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards

Benchmark Brief

MMLU

Coverage: 15 papers (1.8%)

15 papers (1.8%) mention MMLU.

Examples: Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference , D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models , Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration

Metric Brief

accuracy

Coverage: 210 papers (25.9%)

210 papers (25.9%) mention accuracy.

Examples: InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models , Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent , Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

Metric Brief

cost

Coverage: 71 papers (8.7%)

71 papers (8.7%) mention cost.

Examples: Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching , Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA , Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue

Metric Brief

latency

Coverage: 37 papers (4.6%)

37 papers (4.6%) mention latency.

Examples: Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems , Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding? , InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Most Cited In This Hub

Fast path to methods with the strongest citation traction in this scope.

Papers: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning , LLM Novice Uplift on Dual-Use, In Silico Biology Tasks , A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations

Most Recent

Fast path to latest protocol changes and newly published evaluation setups.

Best Protocol Detail

Papers with explicit rater/unit metadata and quality-control signals for reproducibility.

Papers: An artificial intelligence framework for end-to-end rare disease phenotyping from clinical notes using large language models , Cross-Cultural Expert-Level Art Critique Evaluation with Vision-Language Models

Top Papers

Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang · Feb 26, 2026 · Citations: 0

Automatic Metrics

With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and t
LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Soumya Dutta, Smruthi Balaji, Sriram Ganapathy · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems.
SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0

Automatic Metrics

Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026 · Citations: 0

Automatic Metrics

Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations.
MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa · Feb 26, 2026 · Citations: 0

Automatic Metrics

We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0

Automatic Metrics

Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa · Feb 26, 2026 · Citations: 0

Automatic Metrics

Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture.
Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department
Gabriela Anna Kaczmarek, Pietro Ferrazzi, Lorenzo Porta, Vicky Rubini, Bernardo Magnini · Feb 26, 2026 · Citations: 0

Automatic Metrics

We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task.
MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0

Automatic Metrics

Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee · Feb 26, 2026 · Citations: 0

Automatic Metrics

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.
Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher · Feb 26, 2026 · Citations: 0

Automatic Metrics

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg, Oren Gal · Feb 26, 2026 · Citations: 0

Automatic Metrics

Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream?
NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong · Feb 26, 2026 · Citations: 0

Automatic Metrics Tool Use

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.
Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan · Feb 26, 2026 · Citations: 0

Automatic Metrics

Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation.
Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
Mohammad Yeghaneh Abkenar, Weixing Wang, Manfred Stede, Davide Picca, Mark A. Finlayson · Feb 26, 2026 · Citations: 0

Automatic Metrics

Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic.
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0

Pairwise Preference Automatic Metrics

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods.
TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian · Feb 26, 2026 · Citations: 0

Automatic Metrics

This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian.
Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0

Critique Edit Automatic Metrics

NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026 · Citations: 0

Automatic Metrics

Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
Imagination Helps Visual Reasoning, But Not Yet in Latent Space
You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models.
Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0

Critique Edit Automatic Metrics

Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple res
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0

Demonstrations Automatic Metrics

We introduce AuditBench, an alignment auditing benchmark.
Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Jakub Šmíd, Pavel Přibáň, Pavel Král · Feb 26, 2026 · Citations: 0

Automatic Metrics

The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Siyue Su, Jian Yang, Bo Li, Guanglin Niu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents.
Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies
Shinnosuke Nozue, Yuto Nakano, Yotaro Watanabe, Meguru Takasaki, Shoji Moriya · Feb 26, 2026 · Citations: 0

Automatic Metrics

Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions.
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs.
Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026 · Citations: 0

Automatic Metrics

Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0

Automatic Metrics

In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026 · Citations: 0

Automatic Metrics

Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.
TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen · Feb 26, 2026 · Citations: 0

Automatic Metrics

Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026 · Citations: 0

Automatic Metrics

We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0

Automatic Metrics

Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0

Automatic Metrics Long Horizon

Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
Ruyi2 Technical Report
Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An · Feb 26, 2026 · Citations: 0

Automatic Metrics

Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.
RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format
Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He · Feb 26, 2026 · Citations: 0

Automatic Metrics

Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality.
Dynamic Level Sets
Michael Stephen Fiske · Feb 26, 2026 · Citations: 0

Automatic Metrics

A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester).
Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
Samay Bhojwani, Swarnima Kain, Lisong Xu · Feb 26, 2026 · Citations: 0

Automatic Metrics

These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
Ryan Liu, Dilip Arumugam, Cedegao E. Zhang, Sean Escola, Xaq Pitkow · Feb 26, 2026 · Citations: 0

Automatic Metrics

This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms.
Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
An-Ci Peng, Kuan-Tang Huang, Tien-Hong Lo, Hung-Shin Lee, Hsin-Min Wang · Feb 26, 2026 · Citations: 0

Automatic Metrics

Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin).
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0

Automatic Metrics

We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička, Hana Bednářová · Feb 25, 2026 · Citations: 0

Automatic Metrics

The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons.
Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0

Automatic Metrics

The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0

Automatic Metrics

Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script.
Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0

Automatic Metrics

We study reasoning for accessing world knowledge stored in a language model's parameters.

CS.CL + Automatic Metrics Papers

Research Narrative

Why This Matters For Eval Research

Protocol Takeaways

Benchmark Interpretation

Metric Interpretation

Researcher Checklist

Suggested Reading Order

Known Limitations

Research Utility Links

Top Papers

Related Hubs