- Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang · Feb 26, 2026 · Citations: 0
Automatic Metrics
With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and t
- LLM Novice Uplift on Dual-Use, In Silico Biology Tasks
Chen Bo Calvin Zhang, Christina Q. Knight, Nicholas Kruus, Jason Hausenloy, Pedro Medeiros · Feb 26, 2026 · Citations: 0
Automatic Metrics
Large language models (LLMs) perform increasingly well on biology benchmarks, but it remains unclear whether they uplift novice users -- i.e., enable humans to perform better than with internet-only resources.
- A Mixture-of-Experts Model for Multimodal Emotion Recognition in Conversations
Soumya Dutta, Smruthi Balaji, Sriram Ganapathy · Feb 26, 2026 · Citations: 0
Automatic Metrics
Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems.
- SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0
Automatic Metrics
Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
- Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems
Siyuan Liu, Jiahui Xu, Feng Jiang, Kuang Wang, Zefeng Zhao · Feb 26, 2026 · Citations: 0
Automatic Metrics
Achieving human-like responsiveness is a critical yet challenging goal for cascaded spoken dialogue systems.
- Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0
Automatic Metrics
Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
- Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Chungpa Lee, Jy-yong Sohn, Kangwook Lee · Feb 26, 2026 · Citations: 0
Demonstrations Automatic Metrics
Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations.
- MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations
Sara Rosenthal, Yannis Katsis, Vraj Shah, Lihong He, Lucian Popa · Feb 26, 2026 · Citations: 0
Automatic Metrics
We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models.
- A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
- Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa · Feb 26, 2026 · Citations: 0
Automatic Metrics
Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture.
- Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026 · Citations: 0
Automatic Metrics
In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
- Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody
Yuqi Shi, Hao Yang, Xiyao Lu, Jinsong Zhang · Feb 26, 2026 · Citations: 0
Automatic Metrics
While second language (L2) learners may acquire target syntactic word order, mapping this syntax onto appropriate prosodic structures remains a persistent challenge.
- Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment
Sanjid Hasan, Risalat Labib, A H M Fuad, Bayazid Hasan · Feb 26, 2026 · Citations: 0
Automatic Metrics
Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.
- Toward Automatic Filling of Case Report Forms: A Case Study on Data from an Italian Emergency Department
Gabriela Anna Kaczmarek, Pietro Ferrazzi, Lorenzo Porta, Vicky Rubini, Bernardo Magnini · Feb 26, 2026 · Citations: 0
Automatic Metrics
We provide an analysis of the data, define the CRF-filling task and metric for its evaluation, and report on pilot experiments where we use an open-source state-of-the-art LLM to automatically execute the task.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0
Automatic Metrics
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention
Jeongin Bae, Baeseong Park, Gunho Park, Minsub Kim, Joonhyung Lee · Feb 26, 2026 · Citations: 0
Automatic Metrics
Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization.
- Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher · Feb 26, 2026 · Citations: 0
Automatic Metrics
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
- Where Vision Becomes Text: Locating the OCR Routing Bottleneck in Vision-Language Models
Jonathan Steinberg, Oren Gal · Feb 26, 2026 · Citations: 0
Automatic Metrics
Vision-language models (VLMs) can read text from images, but where does this optical character recognition (OCR) information enter the language processing stream?
- NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
- OmniGAIA: Towards Native Omni-Modal AI Agents
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Shijian Wang, Guanting Dong · Feb 26, 2026 · Citations: 0
Automatic Metrics Tool Use
Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world.
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.
- Effective QA-driven Annotation of Predicate-Argument Relations Across Languages
Jonathan Davidov, Aviv Slobodkin, Shmuel Tomi Klein, Reut Tsarfaty, Ido Dagan · Feb 26, 2026 · Citations: 0
Automatic Metrics
Explicit representations of predicate-argument relations form the basis of interpretable semantic analysis, supporting reasoning, generation, and evaluation.
- Improving Neural Argumentative Stance Classification in Controversial Topics with Emotion-Lexicon Features
Mohammad Yeghaneh Abkenar, Weixing Wang, Manfred Stede, Davide Picca, Mark A. Finlayson · Feb 26, 2026 · Citations: 0
Automatic Metrics
Argumentation mining comprises several subtasks, among which stance classification focuses on identifying the standpoint expressed in an argumentative text toward a specific target topic.
- Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort, Tushar Karayil, Urja Pawar, Robert Graham, Alex McKenzie · Feb 26, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences.
- TCM-DiffRAG: Personalized Syndrome Differentiation Reasoning Method for Traditional Chinese Medicine based on Knowledge Graph and Chain of Thought
Jianmin Li, Ying Chang, Su-Kit Tang, Yujia Liu, Yanwen Wang · Feb 26, 2026 · Citations: 0
Automatic Metrics
Additionally, TCM-DiffRAG outperformed directly supervised fine-tuned (SFT) LLMs and other benchmark RAG methods.
- TARAZ: Persian Short-Answer Question Benchmark for Cultural Evaluation of Language Models
Reihaneh Iranmanesh, Saeedeh Davoudi, Pasha Abrishamchian, Ophir Frieder, Nazli Goharian · Feb 26, 2026 · Citations: 0
Automatic Metrics
This paper presents a comprehensive evaluation framework for assessing the cultural competence of large language models (LLMs) in Persian.
- Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0
Critique Edit Automatic Metrics
NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
- Probing for Knowledge Attribution in Large Language Models
Ivo Brink, Alexander Boer, Dennis Ulmer · Feb 26, 2026 · Citations: 0
Automatic Metrics
Probes trained on AttriWiki data reveal a strong attribution signal, achieving up to 0.96 Macro-F1 on Llama-3.1-8B, Mistral-7B, and Qwen-7B, transferring to out-of-domain benchmarks (SQuAD, WebQuestions) with 0.94-0.99 Macro-F1 without retr
- Imagination Helps Visual Reasoning, But Not Yet in Latent Space
You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang · Feb 26, 2026 · Citations: 0
Automatic Metrics
Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models.
- Towards Better RL Training Data Utilization via Second-Order Rollout
Zhe Yang, Yudong Wang, Rang Li, Zhifang Sui · Feb 26, 2026 · Citations: 0
Critique Edit Automatic Metrics
Reinforcement Learning (RL) has empowered Large Language Models (LLMs) with strong reasoning capabilities, but vanilla RL mainly focuses on generation capability improvement by training with only first-order rollout (generating multiple res
- AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors
Abhay Sheshadri, Aidan Ewart, Kai Fronsdal, Isha Gupta, Samuel R. Bowman · Feb 26, 2026 · Citations: 0
Demonstrations Automatic Metrics
We introduce AuditBench, an alignment auditing benchmark.
- Extending Czech Aspect-Based Sentiment Analysis with Opinion Terms: Dataset and LLM Benchmarks
Jakub Šmíd, Pavel Přibáň, Pavel Král · Feb 26, 2026 · Citations: 0
Automatic Metrics
The dataset establishes a new benchmark for Czech ABSA, and our proposed translation-alignment approach offers a scalable solution for adapting ABSA resources to other low-resource languages.
- Replacing Multi-Step Assembly of Data Preparation Pipelines with One-Step LLM Pipeline Generation for Table QA
Fengyu Li, Junhao Zhu, Kaishi Song, Lu Chen, Zhongming Yao · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on two benchmark datasets show that, with the same LLM backbone, Operation-R1 achieves average absolute accuracy gains of 9.55 and 6.08 percentage points over multi-step preparation baselines, with 79\% table compression and a 2
- Tokenization, Fusion and Decoupling: Bridging the Granularity Mismatch Between Large Language Models and Knowledge Graphs
Siyue Su, Jian Yang, Bo Li, Guanglin Niu · Feb 26, 2026 · Citations: 0
Automatic Metrics
Experimental results show that KGT consistently outperforms state-of-the-art methods across multiple benchmarks.
- Reinforcing Real-world Service Agents: Balancing Utility and Cost in Task-oriented Dialogue
Ning Gao, Wei Zhang, Yuqin Dai, Ling Shi, Ziyin Wang · Feb 26, 2026 · Citations: 0
Automatic Metrics
The rapid evolution of Large Language Models (LLMs) has accelerated the transition from conversational chatbots to general agents.
- Enhancing Persuasive Dialogue Agents by Synthesizing Cross-Disciplinary Communication Strategies
Shinnosuke Nozue, Yuto Nakano, Yotaro Watanabe, Meguru Takasaki, Shoji Moriya · Feb 26, 2026 · Citations: 0
Automatic Metrics
Current approaches to developing persuasive dialogue agents often rely on a limited set of predefined persuasive strategies that fail to capture the complexity of real-world interactions.
- Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Qianben Chen, Tianrui Qin, King Zhu, Qiexiang Wang, Chengjun Yu · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Recent deep research agents primarily improve performance by scaling reasoning depth, but this leads to high inference cost and latency in search-intensive scenarios.
- dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song · Feb 26, 2026 · Citations: 0
Automatic Metrics
To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs.
- Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026 · Citations: 0
Automatic Metrics
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
- Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0
Automatic Metrics
In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
- ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026 · Citations: 0
Automatic Metrics
Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
- pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang · Feb 26, 2026 · Citations: 0
Automatic Metrics
Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.
- TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion
Donghong Cai, Jiarui Feng, Yanbo Wang, Da Zheng, Yixin Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
Extensive experiments on diverse benchmarks demonstrate the effectiveness of TabDLM compared to strong diffusion- and LLM-based baselines.
- Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA
Wenwei Li, Ming Xu, Tianle Xia, Lingxiang Hu, Yiding Sun · Feb 26, 2026 · Citations: 0
Automatic Metrics
We propose a reinforced co-adaptation framework that jointly optimizes retrieval and generation through two components: (1) Graph-aware Retrieval (GraphRAG), which models entity-relation structure over a high-citation knowledge subgraph for
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0
Automatic Metrics
Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026 · Citations: 0
Automatic Metrics
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
- Ruyi2 Technical Report
Huan Song, Shuyu Tian, Junyi Hao, Minxiu Xu, Hongjun An · Feb 26, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) face significant challenges regarding deployment costs and latency, necessitating adaptive computing strategies.
- RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format
Zhehao Huang, Yuhang Liu, Baijiong Lin, Yixin Lou, Zhengbao He · Feb 26, 2026 · Citations: 0
Automatic Metrics
Across four instruction-following benchmarks and nine reasoning & general capability benchmarks, RAIN-Merging substantially improves instruction adherence while maintaining reasoning quality.
- Dynamic Level Sets
Michael Stephen Fiske · Feb 26, 2026 · Citations: 0
Automatic Metrics
A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester).
- Iterative Prompt Refinement for Dyslexia-Friendly Text Summarization Using GPT-4o
Samay Bhojwani, Swarnima Kain, Lisong Xu · Feb 26, 2026 · Citations: 0
Automatic Metrics
These findings establish an empirical baseline for accessibility-driven NLP summarization and motivate further human-centered evaluation with dyslexic readers.
- Cognitive Models and AI Algorithms Provide Templates for Designing Language Agents
Ryan Liu, Dilip Arumugam, Cedegao E. Zhang, Sean Escola, Xaq Pitkow · Feb 26, 2026 · Citations: 0
Automatic Metrics
This position paper argues that potential blueprints for designing such modular language agents can be found in the existing literature on cognitive models and artificial intelligence (AI) algorithms.
- Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing
An-Ci Peng, Kuan-Tang Huang, Tien-Hong Lo, Hung-Shin Lee, Hsin-Min Wang · Feb 26, 2026 · Citations: 0
Automatic Metrics
Taiwanese Hakka is a low-resource, endangered language that poses significant challenges for automatic speech recognition (ASR), including high dialectal variability and the presence of two distinct writing systems (Hanzi and Pinyin).
- Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0
Automatic Metrics
We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
- Sydney Telling Fables on AI and Humans: A Corpus Tracing Memetic Transfer of Persona between LLMs
Jiří Milička, Hana Bednářová · Feb 25, 2026 · Citations: 0
Automatic Metrics
The way LLM-based entities conceive of the relationship between AI and humans is an important topic for both cultural and safety reasons.
- Recovered in Translation: Efficient Pipeline for Automated Translation of Benchmarks and Datasets
Hanna Yukhymenko, Anton Alexandrov, Martin Vechev · Feb 25, 2026 · Citations: 0
Automatic Metrics
The reliability of multilingual Large Language Model (LLM) evaluation is currently compromised by the inconsistent quality of translated benchmarks.
- SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0
Automatic Metrics
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script.
- Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0
Automatic Metrics
We study reasoning for accessing world knowledge stored in a language model's parameters.