- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026 · Citations: 0
Automatic Metrics
Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
- NoRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion
Hung-Hsuan Chen · Feb 26, 2026 · Citations: 0
Automatic Metrics
On the SlimOrca benchmark, NoRA breaks this linear barrier: NoRA remarkably at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency.
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026 · Citations: 0
Automatic Metrics
Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026 · Citations: 0
Automatic Metrics
We validate across five benchmarks, five models from three families, and both synthetic and real data.
- Equitable Evaluation via Elicitation
Elbert Du, Cynthia Dwork, Lunjia Hu, Reid McIlroy-Young, Han Shao · Feb 24, 2026 · Citations: 0
Automatic Metrics
To obtain sufficient training data, we train an LLM to act as synthetic humans.
- Aletheia tackles FirstProof autonomously
Tony Feng, Junehyuk Jung, Sang-hyun Kim, Carlo Pagano, Sergei Gukov · Feb 24, 2026 · Citations: 0
Automatic Metrics
We report the performance of Aletheia (Feng et al., 2026b), a mathematics research agent powered by Gemini 3 Deep Think, on the inaugural FirstProof challenge.
- Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026 · Citations: 0
Automatic Metrics
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
- Group Orthogonalized Policy Optimization:Group Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Feb 24, 2026 · Citations: 0
Automatic Metrics
Experiments on mathematical reasoning benchmarks show that GOPO achieves competitive generalization while maintaining stable gradient dynamics and entropy preservation in regimes where clipping-based methods plateau.
- ToolMATH: A Math Tool Benchmark for Realistic Long-Horizon Multi-Tool Reasoning
Hyeonje Choi, Jeongsoo Lee, Hyojun Lee, Jay-Yoon Lee · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce \ToolMATH, a math-grounded benchmark that evaluates tool-augmented language models in realistic multi-tool environments where the output depends on calling schema-specified tools and sustaining multi-step execution.
- GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026 · Citations: 0
Automatic Metrics Long Horizon
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026 · Citations: 0
Automatic Metrics
In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
- Hyperbolic Busemann Neural Networks
Ziheng Chen, Bernhard Schölkopf, Nicu Sebe · Feb 21, 2026 · Citations: 0
Automatic Metrics
Hyperbolic spaces provide a natural geometry for representing hierarchical and tree-structured data due to their exponential volume growth.
- VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig · Feb 20, 2026 · Citations: 0
Automatic Metrics
However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026 · Citations: 0
Automatic Metrics
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- TFL: Targeted Bit-Flip Attack on Large Language Model
Jingkai Guo, Chaitali Chakrabarti, Deliang Fan · Feb 19, 2026 · Citations: 0
Automatic Metrics
Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks.
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu · Feb 18, 2026 · Citations: 0
Automatic Metrics
Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoR
- From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026 · Citations: 0
Automatic Metrics
Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear.
- Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026 · Citations: 0
Automatic Metrics
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026 · Citations: 0
Automatic Metrics
Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
- Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026 · Citations: 0
Automatic Metrics
We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
- Scaling Beyond Masked Diffusion Language Models
Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu · Feb 16, 2026 · Citations: 0
Automatic Metrics
Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks.
- Cold-Start Personalization via Training-Free Priors from Structured World Models
Avinandan Bose, Shuyue Stella Li, Faeze Brahman, Pang Wei Koh, Simon Shaolei Du · Feb 16, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Cold-start personalization requires inferring user preferences through interaction when no user-specific historical data is available.
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang · Feb 12, 2026 · Citations: 0
Expert Verification Automatic Metrics
On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distil
- Orthogonalized Policy Optimization:Policy Optimization as Orthogonal Projection in Hilbert Space
Wang Zixian · Jan 18, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on MATH benchmarks show that the Hilbert projection formulation prevents gradient saturation typical of KL-constrained methods.
- Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan · Dec 8, 2025 · Citations: 0
Automatic Metrics
We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions.
- CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025 · Citations: 0
Automatic Metrics
The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
- A Proof of Learning Rate Transfer under $μ$P
Soufiane Hayou · Nov 3, 2025 · Citations: 0
Automatic Metrics
We provide the first proof of learning rate transfer with width in a linear multi-layer perceptron (MLP) parametrized with $μ$P, a neural network parameterization designed to ``maximize'' feature learning in the infinite-width limit.
- From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025 · Citations: 0
Simulation Env
Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient.
- Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025 · Citations: 0
Automatic Metrics
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025 · Citations: 0
Automatic Metrics
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.
- NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025 · Citations: 0
Automatic Metrics
However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
- Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia · Aug 21, 2025 · Citations: 0
Automatic Metrics
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-
- $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025 · Citations: 0
Automatic Metrics
Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
- Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025 · Citations: 0
Automatic Metrics
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
- Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025 · Citations: 0
Automatic Metrics
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0
Automatic Metrics
On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
- BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng · May 18, 2025 · Citations: 0
Automatic Metrics
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning.
- Lean Formalization of Generalization Error Bound by Rademacher Complexity and Dudley's Entropy Integral
Sho Sonoda, Kazumi Kasaura, Yuma Mizuno, Kei Tsukamoto, Naoto Onda · Mar 25, 2025 · Citations: 0
Automatic Metrics
Understanding and certifying the generalization performance of machine learning algorithms -- i.e.
- Humanity's Last Exam
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu · Jan 24, 2025 · Citations: 0
Automatic Metrics
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities.
- Multi-agent deep reinforcement learning with centralized training and decentralized execution for transportation infrastructure management
M. Saifullah, K. G. Papakonstantinou, A. Bhattacharya, S. M. Stoffels, C. P. Andriotis · Jan 23, 2024 · Citations: 0
Simulation Env Multi Agent
To tackle the high dimensionality of state and action spaces, we propose DDMAC-CTDE, a Deep Decentralized Multi-Agent Actor-Critic (DDMAC) reinforcement learning architecture with Centralized Training and Decentralized Execution (CTDE).
- Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise
Zhenkai Zhang, Krista A. Ehinger, Tom Drummond · Oct 26, 2023 · Citations: 0
Automatic Metrics
This paper introduces two key contributions aimed at improving the speed and quality of images generated through inverse diffusion processes.