- SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables
Sungho Park, Jueun Kim, Wook-Shin Han · Feb 26, 2026 · Citations: 0
Automatic Metrics
Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in n
- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0
Automatic Metrics Multi Agent
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
- Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0
Automatic Metrics
Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
- A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
- Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa · Feb 26, 2026 · Citations: 0
Automatic Metrics
Multimodal LLMs can process speech and images, but they cannot hear a speaker's voice or see an object's texture.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026 · Citations: 0
Automatic Metrics
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- Frequency-Ordered Tokenization for Better Text Compression
Maximilian Kalcher · Feb 26, 2026 · Citations: 0
Automatic Metrics
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law).
- NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Rejection Mixing: Fast Semantic Propagation of Mask Tokens for Efficient DLLM Inference
Yushi Ye, Feng Hong, Huangjie Zheng, Xu Chen, Zhiyong Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
Diffusion Large Language Models (DLLMs) promise fast non-autoregressive inference but suffer a severe quality-speed trade-off in parallel decoding.
- Natural Language Declarative Prompting (NLD-P): A Modular Governance Method for Prompt Design Under Model Drift
Hyunwoo Kim, Hanau Yi, Jaehee Bae, Yumin Kim · Feb 26, 2026 · Citations: 0
Critique Edit Automatic Metrics
NLD-P is formalized as a modular control abstraction that separates provenance, constraint logic, task content, and post-generation evaluation, encoded directly in natural language without reliance on external orchestration code.
- Imagination Helps Visual Reasoning, But Not Yet in Latent Space
You Li, Chi Chen, Yanghao Li, Fanhu Zeng, Kaiyu Huang · Feb 26, 2026 · Citations: 0
Automatic Metrics
Latent visual reasoning aims to mimic human's imagination process by meditating through hidden states of Multimodal Large Language Models.
- dLLM: Simple Diffusion Language Modeling
Zhanhui Zhou, Lingjie Chen, Hanghang Tong, Dawn Song · Feb 26, 2026 · Citations: 0
Automatic Metrics
To address this gap, we introduce dLLM, an open-source framework that unifies the core components of diffusion language modeling -- training, inference, and evaluation -- and makes them easy to customize for new designs.
- Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026 · Citations: 0
Automatic Metrics
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
- Vectorizing the Trie: Efficient Constrained Decoding for LLM-based Generative Retrieval on Accelerators
Zhengyang Su, Isay Katsman, Yueqi Wang, Ruining He, Lukasz Heldt · Feb 26, 2026 · Citations: 0
Automatic Metrics
In addition, evaluation on academic benchmarks demonstrates that STATIC can considerably improve cold-start performance for generative retrieval.
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0
Automatic Metrics
Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
- Dynamic Level Sets
Michael Stephen Fiske · Feb 26, 2026 · Citations: 0
Automatic Metrics
A mathematical concept is identified and analyzed that is implicit in the 2012 paper Turing Incomputable Computation, presented at the Alan Turing Centenary Conference (Turing 100, Manchester).
- Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles, Patrick Schrempf, David Harris-Birtill · Feb 25, 2026 · Citations: 0
Automatic Metrics
We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical d
- VeRO: An Evaluation Harness for Agents to Optimize Agents
Varun Ursekar, Apaar Shanker, Veronica Chatrath, Yuan, Xue · Feb 25, 2026 · Citations: 0
Automatic Metrics
An important emerging application of coding agents is agent optimization: the iterative improvement of a target agent through edit-execute-evaluate cycles.
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- Causality $\neq$ Invariance: Function and Concept Vectors in LLMs
Gustaw Opiełka, Hannes Rosenbusch, Claire E. Stevenson · Feb 25, 2026 · Citations: 0
Automatic Metrics
Do large language models (LLMs) represent concepts abstractly, i.e., independent of input format?
- Decoder-based Sense Knowledge Distillation
Qitong Wang, Mohammed J. Zaki, Georgios Kollias, Vasileios Kalantzis · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive experiments on diverse benchmarks demonstrate that DSKD significantly enhances knowledge distillation performance for decoders, enabling generative models to inherit structured semantics while maintaining efficient training.
- SumTablets: A Transliteration Dataset of Sumerian Tablets
Cole Simmons, Richard Diehl Martinez, Dan Jurafsky · Feb 25, 2026 · Citations: 0
Automatic Metrics
Sumerian transliteration is a conventional system for representing a scholar's interpretation of a tablet in the Latin script.
- Improving Parametric Knowledge Access in Reasoning Language Models
Melody Ma, John Hewitt · Feb 25, 2026 · Citations: 0
Automatic Metrics
We study reasoning for accessing world knowledge stored in a language model's parameters.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- DySCO: Dynamic Attention-Scaling Decoding for Long-Context LMs
Xi Ye, Wuwei Zhang, Fangcong Yin, Howard Yen, Danqi Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Across multiple instruction-tuned and reasoning models, DySCO consistently improves performance on challenging long-context reasoning benchmarks, yielding relative gains of up to 25% on MRCR and LongBenchV2 at 128K context length with modes
- NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors
Lingfeng Ren, Weihao Yu, Runpeng Yu, Xinchao Wang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain
Walter Hernandez Cruz, Peter Devine, Nikhil Vadgama, Paolo Tasca, Jiahua Xu · Feb 25, 2026 · Citations: 0
Automatic Metrics
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United Sta
- TG-ASR: Translation-Guided Learning with Parallel Gated Cross Attention for Low-Resource Automatic Speech Recognition
Cheng-Yeh Yang, Chien-Chun Wang, Li-Wei Chen, Hung-Shin Lee, Hsin-Min Wang · Feb 25, 2026 · Citations: 0
Simulation Env
Low-resource automatic speech recognition (ASR) continues to pose significant challenges, primarily due to the limited availability of transcribed data for numerous languages.
- MEDSYN: Benchmarking Multi-EviDence SYNthesis in Complex Clinical Cases for Multimodal Large Language Models
Boqi Chen, Xudong Liu, Jiachuan Peng, Marianne Frey-Marti, Bang Zheng · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Multimodal large language models (MLLMs) have shown great potential in medical applications, yet existing benchmarks inadequately capture real-world clinical complexity.
- Small Wins Big: Comparing Large Language Models and Domain Fine-Tuned Models for Sarcasm Detection in Code-Mixed Hinglish Text
Bitan Majumder, Anirban Sen · Feb 25, 2026 · Citations: 0
Automatic MetricsSimulation Env
Sarcasm detection in multilingual and code-mixed environments remains a challenging task for natural language processing models due to structural variations, informal expressions, and low-resource linguistic availability.
- Scalable Kernel-Based Distances for Statistical Inference and Integration
Masha Naslidnyk · Feb 25, 2026 · Citations: 0
Simulation Env
Representing, comparing, and measuring the distance between probability distributions is a key task in computational statistics and machine learning.
- xai-cola: A Python library for sparsifying counterfactual explanations
Lin Zhu, Lei You · Feb 25, 2026 · Citations: 0
Automatic Metrics
Counterfactual explanation (CE) is an important domain within post-hoc explainability.
- DocDjinn: Controllable Synthetic Document Generation with VLMs and Handwriting Diffusion
Marcel Lamott, Saifullah Saifullah, Nauman Riaz, Yves-Noel Weweler, Tobias Alt-Veit · Feb 25, 2026 · Citations: 0
Automatic Metrics
We evaluate across eleven benchmarks spanning key information extraction, question answering, document classification, and document layout analysis.
- Robust Long-Form Bangla Speech Processing: Automatic Speech Recognition and Speaker Diarization
MD. Sagor Chowdhury, Adiba Fairooz Chowdhury · Feb 25, 2026 · Citations: 0
Automatic Metrics
We describe our end-to-end system for Bengali long-form speech recognition (ASR) and speaker diarization submitted to the DL Sprint 4.0 competition on Kaggle.
- Two-Stage Active Distribution Network Voltage Control via LLM-RL Collaboration: A Hybrid Knowledge-Data-Driven Approach
Xu Yang, Chenhui Lin, Xiang Ma, Dong Liu, Ran Zheng · Feb 25, 2026 · Citations: 0
Automatic Metrics
Considering the operational scenarios and requirements in real-world ADNs, in this paper, we propose a hybrid knowledge-data-driven approach that leverages dynamic collaboration between a large language model (LLM) agent and a reinforcement
- SurGo-R1: Benchmarking and Modeling Contextual Reasoning for Operative Zone in Surgical Video
Guanyi Qin, Xiaozhen Wang, Zhu Zhuo, Chang Han Low, Yuancan Xiao · Feb 25, 2026 · Citations: 0
Expert Verification Automatic Metrics
Existing AI systems offer binary safety verification or static detection, ignoring the phase-dependent nature of intraoperative reasoning.
- Following the Diagnostic Trace: Visual Cognition-guided Cooperative Network for Chest X-Ray Diagnosis
Shaoxuan Wu, Jingkun Chen, Chong Ma, Cong Shen, Xiao Zhang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Human-AI collaboration seeks to enhance the reliability of diagnostic models by integrating the behaviors of controllable radiologists.
- Sparsity Induction for Accurate Post-Training Pruning of Large Language Models
Minhao Jiang, Zhikai Li, Xuewen Liu, Jing Zhang, Mengjuan Chen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large language models have demonstrated capabilities in text generation, while their increasing parameter scales present challenges in computational and memory efficiency.
- Scalable Multilingual Multimodal Machine Translation with Speech-Text Fusion
Yexing Du, Youcheng Pan, Zekun Wang, Zheng Chu, Yichong Huang · Feb 25, 2026 · Citations: 0
Automatic Metrics
Experimental results demonstrate that our framework surpasses all existing methods on the Multi30K multimodal machine translation benchmark, achieving new state-of-the-art results.
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics Long Horizon
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- MixSarc: A Bangla-English Code-Mixed Corpus for Implicit Meaning Identification
Kazi Samin Yasar Alam, Md Tanbir Chowdhury, Tamim Ahmed, Ajwad Abrar, Md Rafid Haque · Feb 25, 2026 · Citations: 0
Human EvalAutomatic Metrics
We benchmark transformer-based models and evaluate zero-shot large language models under structured prompting.
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- One Brain, Omni Modalities: Towards Unified Non-Invasive Brain Decoding with Large Language Models
Changli Tang, Shurui Li, Junliang Wang, Qinfan Xiao, Zhonghao Zhai · Feb 25, 2026 · Citations: 0
Automatic Metrics
Extensive evaluations demonstrate that NOBEL serves as a robust generalist across standard single-modal tasks.
- MINAR: Mechanistic Interpretability for Neural Algorithmic Reasoning
Jesse He, Helen Jenne, Max Vargas, Davis Brown, Gal Mishne · Feb 24, 2026 · Citations: 0
Automatic Metrics
The recent field of neural algorithmic reasoning (NAR) studies the ability of graph neural networks (GNNs) to emulate classical algorithms like Bellman-Ford, a phenomenon known as algorithmic alignment.
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics
Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
- FedVG: Gradient-Guided Aggregation for Enhanced Federated Learning
Alina Devkota, Jacob Thrasher, Donald Adjeroh, Binod Bhattarai, Prashnna K. Gyawali · Feb 24, 2026 · Citations: 0
Automatic Metrics
Extensive experiments on both natural and medical image benchmarking datasets, across diverse model architectures, demonstrate that FedVG consistently improves performance, particularly in highly heterogeneous settings.
- MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
Daniel Tamayo, Iñaki Lacunza, Paula Rivera-Hidalgo, Severino Da Dalt, Javier Aula-Blasco · Feb 24, 2026 · Citations: 0
Automatic Metrics
We introduce MrBERT, a family of 150M-300M parameter encoders built on the ModernBERT architecture and pre-trained on 35 languages and code.
- The Mean is the Mirage: Entropy-Adaptive Model Merging under Heterogeneous Domain Shifts in Medical Imaging
Sameer Ambekar, Reza Nasirigerdeh, Peter J. Schuffler, Lina Felsner, Daniel M. Lang · Feb 24, 2026 · Citations: 0
Automatic Metrics
We extensively evaluate our method with state-of-the-art baselines using two backbones across nine medical and natural-domain generalization image classification datasets, showing consistent gains across standard evaluation and challenging
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026 · Citations: 0
Automatic Metrics
We validate across five benchmarks, five models from three families, and both synthetic and real data.
- Representation Theorems for Cumulative Propositional Dependence Logics
Juha Kontinen, Arne Meier, Kai Sauerwald · Feb 24, 2026 · Citations: 0
Automatic Metrics
This paper establishes and proves representation theorems for cumulative propositional dependence logic and for cumulative propositional logic with team semantics.
- A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives
Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis.
- Scaling View Synthesis Transformers
Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann · Feb 24, 2026 · Citations: 0
Automatic Metrics
Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto fronti
- Equitable Evaluation via Elicitation
Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao · Feb 24, 2026 · Citations: 0
Automatic Metrics
To obtain sufficient training data, we train an LLM to act as synthetic humans.
- Multi-Vector Index Compression in Any Modality
Hanxiang Qin, Alexander Martin, Rohan Jha, Chunsheng Zuo, Reno Kriz · Feb 24, 2026 · Citations: 0
Automatic Metrics
We study efficient multi-vector retrieval for late interaction in any modality.
- Aletheia tackles FirstProof autonomously
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov · Feb 24, 2026 · Citations: 0
Automatic Metrics
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge.
- Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0
Automatic Metrics
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.