- AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning
Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding · Feb 26, 2026 · Citations: 0
Automatic Metrics Multi Agent
While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information generated by individual participants.
- Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?
Pengxiang Li, Dilxat Muhtar, Lu Yin, Tianlong Chen, Shiwei Liu · Feb 26, 2026 · Citations: 0
Automatic Metrics
Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases.
- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
- A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring
Usman Anwar, Julianna Piskorz, David D. Baek, David Africa, Jim Weatherall · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred f
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026 · Citations: 0
Automatic Metrics Long Horizon
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance
Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song · Feb 26, 2026 · Citations: 0
Automatic Metrics
Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, lea
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0
Automatic Metrics
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
- LogicGraph : Benchmarking Multi-Path Logical Reasoning via Neuro-Symbolic Generation and Verification
Yanrui Wu, Lingling Zhang, Xinyu Zhang, Jiayu Chang, Pengyu Li · Feb 24, 2026 · Citations: 0
Automatic Metrics
Evaluations of large language models (LLMs) primarily emphasize convergent logical reasoning, where success is defined by producing a single correct proof.
- ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
Xindian Ma, Rundong Kong, Peng Zhang, Ruoxiang Huang, Yongyu Jiang · Feb 24, 2026 · Citations: 0
Automatic Metrics
We evaluate ID-LoRA on five diverse benchmarks: Mathematical Reasoning, Code Generation, MMLU, CommonsenseQA, and Safety Alignment.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026 · Citations: 0
Automatic Metrics Long Horizon
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
- Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0
Automatic Metrics
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
- Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026 · Citations: 0
Automatic Metrics Long Horizon
LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
- SPQ: An Ensemble Technique for Large Language Model Compression
Jiamin Yao, Eren Gultepe · Feb 20, 2026 · Citations: 0
Automatic MetricsSimulation Env
Applied to LLaMA-2-7B, SPQ achieves up to 75% memory reduction while maintaining or improving perplexity (e.g., WikiText-2 5.47 to 4.91) and preserving accuracy on downstream benchmarks such as C4, TruthfulQA, and GSM8K.
- VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026 · Citations: 0
Automatic Metrics
However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.
- Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning
Lexiang Tang, Weihao Gao, Bingchen Zhao, Lu Ma, Qiao jin · Feb 20, 2026 · Citations: 0
Automatic Metrics
Experiments show that CCD significantly improves accuracy across mathematical reasoning benchmarks while substantially reducing output length, with minimal KV-cache overhead.
- RFEval: Benchmarking Reasoning Faithfulness under Counterfactual Reasoning Intervention in Large Reasoning Models
Yunseok Han, Yejoon Lee, Jaeyoung Do · Feb 19, 2026 · Citations: 0
Automatic Metrics
To operationalize this, we present RFEval, a benchmark of 7,186 instances across seven tasks that probes faithfulness via controlled, output-level counterfactual interventions.
- Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0
Automatic Metrics
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0
Automatic Metrics
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0
Expert Verification Automatic Metrics
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
- Search or Accelerate: Confidence-Switched Position Beam Search for Diffusion Language Models
Mingyu Cao, Alvaro H. C. Correia, Christos Louizos, Shiwei Liu, Lu Yin · Feb 11, 2026 · Citations: 0
Automatic Metrics
Across mathematical reasoning and code generation benchmarks (GSM8K, MBPP, HumanEval) on Dream-7B and LLaDA-8B, SOAR improves generation quality while maintaining competitive inference speed, offering a practical way to balance quality and
- LLMs Know More About Numbers than They Can Say
Fengting Yuchi, Li Du, Jason Eisner · Feb 8, 2026 · Citations: 0
Automatic Metrics
Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big
- Accelerating Scientific Research with Gemini: Case Studies and Common Techniques
David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo · Feb 3, 2026 · Citations: 0
Automatic Metrics
Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer.
- CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025 · Citations: 0
Automatic Metrics
The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025 · Citations: 0
Automatic Metrics
We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
- Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025 · Citations: 0
Automatic Metrics
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
- ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025 · Citations: 0
Automatic Metrics
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0
Automatic Metrics
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.
- NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0
Automatic Metrics
However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
- Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025 · Citations: 0
Automatic Metrics
Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
- Hidden Dynamics of Massive Activations in Transformer Training
Jorge Gallego-Feliciano, S. Aaron McClendon, Juan Morinelli, Stavros Zervoudakis, Antonios Saravanos · Aug 5, 2025 · Citations: 0
Automatic Metrics
We present the first comprehensive analysis of massive activation development throughout transformer training, using the Pythia model family as our testbed, and release our full dataset publicly to support further research.
- Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025 · Citations: 0
Automatic Metrics
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
- Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025 · Citations: 0
Automatic Metrics
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
- MathFimer: Enhancing Mathematical Reasoning by Expanding Reasoning Steps through Fill-in-the-Middle Task
Yuchen Yan, Yongliang Shen, Yang Liu, Jin Jiang, Xin Xu · Feb 17, 2025 · Citations: 0
Automatic Metrics
Through comprehensive experiments on multiple mathematical reasoning datasets, including MathInstruct, MetaMathQA and etc., we demonstrate that models trained on MathFimer-expanded data consistently outperform their counterparts trained on