- InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models
Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross · Feb 26, 2026
Automatic Metrics MathCoding
Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
- Assessing Deanonymization Risks with Stylometry-Assisted LLM Agent
Boyang Zhang, Yang Zhang · Feb 26, 2026
Automatic Metrics General
In this work, we introduce an LLM agent designed to evaluate and mitigate such risks through a structured, interpretable pipeline.
- MoDora: Tree-Based Semi-Structured Document Analysis System
Bangrui Xu, Qihang Yao, Zirui Tang, Xuanhe Zhou, Yeye He · Feb 26, 2026
Automatic Metrics Coding
Semi-structured documents integrate diverse interleaved data elements (e.g., tables, charts, hierarchical paragraphs) arranged in various and often irregular layouts.
- ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL
Xingyu Lu, Jinpeng Wang, YiFan Zhang, Shijie Ma, Xiao Hu · Feb 26, 2026
Automatic Metrics General
Experimental results on 11 perception and reasoning benchmarks show that ContextRL significantly improves knowledge discovery efficiency.
- pQuant: Towards Effective Low-Bit Language Models via Decoupled Linear Quantization-Aware Training
Wenzheng Zhang, Bingzheng Liu, Yang Hu, Xiaoying Bai, Wentao Zhang · Feb 26, 2026
Automatic Metrics General
Quantization-Aware Training from scratch has emerged as a promising approach for building efficient large language models (LLMs) with extremely low-bit weights (sub 2-bit), which can offer substantial advantages for edge deployment.
- Search-P1: Path-Centric Reward Shaping for Stable and Efficient Agentic RAG Training
Tianle Xia, Ming Xu, Lingxiang Hu, Yiding Sun, Wenwei Li · Feb 26, 2026
Automatic Metrics General
Agentic RAG addresses this by enabling LLMs to dynamically decide when and what to retrieve, but current RL-based training methods suffer from sparse outcome rewards that discard intermediate signals and low sample efficiency where failed s
- Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian · Feb 26, 2026
Automatic Metrics General
Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries.
- A Fusion of context-aware based BanglaBERT and Two-Layer Stacked LSTM Framework for Multi-Label Cyberbullying Detection
Mirza Raquib, Asif Pervez Polok, Kedar Nath Biswas, Rahat Uddin Azad, Saydul Akbar Murad · Feb 25, 2026
Automatic Metrics General
Evaluation uses multiple metrics, including accuracy, precision, recall, F1-score, Hamming loss, Cohens kappa, and AUC-ROC.
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026
Automatic Metrics Coding
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL
Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang · Feb 25, 2026
Automatic Metrics Coding
Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks.
- Distill and Align Decomposition for Enhanced Claim Verification
Jabez Magomere, Elena Kochkina, Samuel Mensah, Simerjot Kaur, Fernando Acero · Feb 25, 2026
Human EvalAutomatic Metrics General
Across six evaluation settings, our trained 8B decomposer improves downstream verification performance to (71.75%) macro-F1, outperforming prompt-based approaches ((+1.99), (+6.24)) and existing RL methods ((+5.84)).
- JSAM: Privacy Straggler-Resilient Joint Client Selection and Incentive Mechanism Design in Differentially Private Federated Learning
Ruichen Xu, Ying-Jun Angela Zhang, Jianwei Huang · Feb 25, 2026
Automatic Metrics General
Extensive evaluations on MNIST and CIFAR-10 demonstrate that JSAM achieves up to 15% improvement in test accuracy compared to existing unbiased selection mechanisms while maintaining cost efficiency across varying data heterogeneity levels.
- Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu · Feb 25, 2026
Automatic Metrics MathCoding
Building on this observation, we introduce Duel-Evolve, an evolutionary optimization algorithm that replaces external scalar rewards with pairwise preferences elicited from the same LLM used to generate candidates.
- From Basis to Basis: Gaussian Particle Representation for Interpretable PDE Operators
Zhihao Li, Yu Feng, Zhilu Lai, Wei Wang · Feb 25, 2026
Automatic Metrics General
On standard PDE benchmarks and real datasets, our method attains state-of-the-art competitive accuracy while providing intrinsic interpretability.
- GradAlign: Gradient-Aligned Data Selection for LLM Reinforcement Learning
Ningyuan Yang, Weihua Du, Weiwei Sun, Sean Welleck, Yiming Yang · Feb 25, 2026
Automatic Metrics General
Reinforcement learning (RL) has become a central post-training paradigm for large language models (LLMs), but its performance is highly sensitive to the quality of training problems.
- Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for Reinforcement Learning
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang · Feb 24, 2026
Automatic Metrics Math
Evaluated on MATH-500 and AIME 2025, ACE composes seamlessly with existing methods and consistently improves the full Pass@k spectrum across all three model families and benchmarks.
- Black-Box Reliability Certification for AI Agents via Self-Consistency Sampling and Conformal Calibration
Charafeddine Mouzouni · Feb 24, 2026
Automatic Metrics Math
We validate across five benchmarks, five models from three families, and both synthetic and real data.
- GATES: Self-Distillation under Privileged Context with Consensus Gating
Alex Stein, Furong Huang, Tom Goldstein · Feb 24, 2026
Automatic Metrics Math
Held-out in-domain accuracy under asymmetric evaluation improves from 46.0\% to 62.0\%, and average (maj@8) accuracy on public document-free math benchmarks improves from 20.2\% to 35.4\%.
- Protein Language Models Diverge from Natural Language: Comparative Analysis and Improved Inference
Anna Hart, Chi Han, Jeonghwan Kim, Huimin Zhao, Heng Ji · Feb 24, 2026
Automatic Metrics General
Modern Protein Language Models (PLMs) apply transformer-based model architectures from natural language processing to biological sequences, predicting a variety of protein functions and properties.
- NanoKnow: How to Know What Your Language Model Knows
Lingwei Gu, Nour Jedidi, Jimmy Lin · Feb 23, 2026
Automatic Metrics Coding
Towards the goal of understanding how knowledge is encoded by LLMs, we release NanoKnow, a benchmark dataset that partitions questions from Natural Questions and SQuAD into splits based on whether their answers are present in nanochat's pre
- Janus-Q: End-to-End Event-Driven Trading via Hierarchical-Gated Reward Modeling
Xiang Li, Zikai Wei, Yiyan Qi, Wanyun Zhou, Xiang Liu · Feb 23, 2026
Automatic Metrics General
Financial market movements are often driven by discrete financial events conveyed through news, whose impacts are heterogeneous, abrupt, and difficult to capture under purely numerical prediction objectives.
- DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning
Zhongwei Wan, Yun Shen, Zhihao Dou, Donghao Zhou, Yu Zhang · Feb 23, 2026
Automatic Metrics Coding
Experiments on multiple reasoning benchmarks demonstrate consistent improvements in accuracy and pass@k, highlighting the importance of dual-scale diversity for deep exploration in RLVR.
- Pyramid MoA: A Probabilistic Framework for Cost-Optimized Anytime Inference
Arindam Khaled · Feb 23, 2026
Automatic Metrics Math
In this work, we propose "Pyramid MoA", a hierarchical Mixture-of-Agents architecture that uses a lightweight Router to dynamically escalate queries only when necessary.
- Adaptive Data Augmentation with Multi-armed Bandit: Sample-Efficient Embedding Calibration for Implicit Pattern Recognition
Minxue Tang, Yangyang Yu, Aolin Ding, Maziyar Baran Pouyan, Taha Belkhouja Yujia Bao · Feb 22, 2026
Automatic Metrics General
Recognizing implicit visual and textual patterns is essential in many real-world applications of modern AI.
- IAPO: Information-Aware Policy Optimization for Token-Efficient Reasoning
Yinhan He, Yaochen Zhu, Mingjia Shi, Wendy Zheng, Lin Su · Feb 22, 2026
Automatic Metrics Coding
Extensive empirical evaluations demonstrate that information-aware advantage shaping is a powerful and general direction for token-efficient post-training.
- Why Agent Caching Fails and How to Fix It: Structured Intent Canonicalization with Few-Shot Learning
Abhinaba Basu · Feb 21, 2026
Automatic Metrics Multilingual
Personal AI agents incur substantial cost via repeated LLM calls.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026
Automatic Metrics Coding
Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
- On the Semantic and Syntactic Information Encoded in Proto-Tokens for One-Step Text Reconstruction
Ivan Bondarenko, Egor Palkin, Fedor Tikunov · Feb 20, 2026
Automatic Metrics Coding
Autoregressive large language models (LLMs) generate text token-by-token, requiring n forward passes to produce a sequence of length n.
- Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory
Usman Anwar, Tim Bakker, Dana Kianfar, Cristina Pinneri, Christos Louizos · Feb 20, 2026
Automatic MetricsSimulation Env Coding
Chain-of-thought (CoT) monitors are LLM-based systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest, such as test-hacking behavior during code generation.
- Agentic Adversarial QA for Improving Domain-Specific LLMs
Vincent Grari, Ciprian Tomoiaga, Sylvain Lamprier, Tatsunori Hashimoto, Marcin Detyniecki · Feb 20, 2026
Automatic Metrics Law
Evaluation on specialized subsets of the LegalBench corpus demonstrates that our method achieves greater accuracy with substantially fewer synthetic samples.
- Gradient Regularization Prevents Reward Hacking in Reinforcement Learning from Human Feedback and Verifiable Rewards
Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama · Feb 20, 2026
Automatic Metrics Math
Reinforcement Learning from Human Feedback (RLHF) or Verifiable Rewards (RLVR) are two key steps in the post-training of modern Language Models (LMs).
- ABCD: All Biases Come Disguised
Mateusz Nowak, Xavier Cadet, Peter Chin · Feb 19, 2026
Automatic Metrics General
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions.
- Training Large Reasoning Models Efficiently via Progressive Thought Encoding
Zeliang Zhang, Xiaodong Liu, Hao Cheng, Hao Sun, Chenliang Xu · Feb 18, 2026
Automatic Metrics Math
Experiments on three models, including Qwen2.5-3B-Instruct, Qwen2.5-7B-Instruct, and DeepSeek-R1-Distill-Llama-8B, on six widely used challenging mathematical benchmarks show consistent gains: our method achieves +19.3% improvement over LoR
- References Improve LLM Alignment in Non-Verifiable Domains
Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty · Feb 18, 2026
Automatic Metrics General
First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs.
- From Growing to Looping: A Unified View of Iterative Computation in LLMs
Ferdinand Kapl, Emmanouil Angelis, Kaitlin Maile, Johannes von Oswald, Stefan Bauer · Feb 18, 2026
Automatic Metrics Math
Looping, reusing a block of layers across depth, and depth growing, training shallow-to-deep models by duplicating middle layers, have both been linked to stronger reasoning, but their relationship remains unclear.
- Emotion Collider: Dual Hyperbolic Mirror Manifolds for Sentiment Recovery via Anti Emotion Reflection
Rong Fu, Ziming Wang, Shuo Yin, Wenxin Zhang, Haiyun Wei · Feb 18, 2026
Automatic Metrics General
Emotional expression underpins natural communication and effective human-computer interaction.
- Anatomy of Capability Emergence: Scale-Invariant Representation Collapse and Top-Down Reorganization in Neural Networks
Jayadev Billa · Feb 17, 2026
Automatic Metrics Coding
Capability emergence during neural network training remains mechanistically opaque.
- Recursive Concept Evolution for Compositional Reasoning in Large Language Models
Sarim Chaudhry · Feb 17, 2026
Automatic Metrics MathCoding
Large language models achieve strong performance on many complex reasoning tasks, yet their accuracy degrades sharply on benchmarks that require compositional reasoning, including ARC-AGI-2, GPQA, MATH, BBH, and HLE.
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities
Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade · Feb 17, 2026
Automatic Metrics MathLaw
Using large scale observational evaluations with 5k observational and 2k newly sampled data on model performance, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs, via
- Weight space Detection of Backdoors in LoRA Adapters
David Puertolas Merenciano, Ekaterina Vasyagina, Raghav Dixit, Kevin Zhu, Ruizhe Li · Feb 16, 2026
Automatic Metrics Math
We evaluate the method on 500 LoRA adapters -- 400 clean, and 100 poisoned for Llama-3.2-3B on instruction and reasoning datasets: Alpaca, Dolly, GSM8K, ARC-Challenge, SQuADv2, NaturalQuestions, HumanEval, and GLUE dataset.
- Seeing to Generalize: How Visual Data Corrects Binding Shortcuts
Nicolas Buzeta, Felipe del Rio, Cristian Hinostroza, Denis Parra, Hans Lobel · Feb 16, 2026
Automatic Metrics CodingMultilingual
Vision Language Models (VLMs) are designed to extend Large Language Models (LLMs) with visual capabilities, yet in this work we observe a surprising phenomenon: VLMs can outperform their underlying LLMs on purely text-only tasks, particular
- LLMStructBench: Benchmarking Large Language Model Structured Data Extraction
Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner · Feb 16, 2026
Automatic Metrics General
We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text.
- Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats
Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin · Feb 13, 2026
Automatic Metrics General
As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency.
- Think like a Scientist: Physics-guided LLM Agent for Equation Discovery
Jianke Yang, Ohm Venkatachalam, Mohammad Kianezhad, Sharvaree Vadgama, Rose Yu · Feb 12, 2026
Automatic Metrics General
We introduce KeplerAgent, an agentic framework that explicitly follows this scientific reasoning process.
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026
Automatic MetricsSimulation Env Coding
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar · Jan 26, 2026
Automatic Metrics Coding
Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
- In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025
Automatic Metrics Coding
We investigate the mechanisms that arise when transformers are trained to solve arithmetic on sequences where tokens are variables whose meaning is determined only through their interactions in-context.
- KD-OCT: Efficient Knowledge Distillation for Clinical-Grade Retinal OCT Classification
Erfan Nourbakhsh, Nasrin Sanjari, Ali Nourbakhsh · Dec 9, 2025
Automatic Metrics MedicineCoding
Age-related macular degeneration (AMD) and choroidal neovascularization (CNV)-related conditions are leading causes of vision loss worldwide, with optical coherence tomography (OCT) serving as a cornerstone for early detection and managemen
- Conflict-Aware Fusion: Resolving Logic Inertia in Large Language Models via Structured Cognitive Priors
Qiming Bao, Xiaoxuan Fu, Michael Witbrock · Dec 6, 2025
Automatic Metrics Law
We present a controlled evaluation framework consisting of four stress tests: (1) rule deletion (redundant vs.
- CDLM: Consistency Diffusion Language Models For Faster Sampling
Minseo Kim, Chenfeng Xu, Coleman Hooper, Harman Singh, Ben Athiwaratkun · Nov 24, 2025
Automatic Metrics MathCoding
The full training and evaluation code is available at https://github.com/SqueezeAILab/CDLM.
- Intelligence per Watt: Measuring Intelligence Efficiency of Local AI
Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya · Nov 11, 2025
Automatic Metrics General
Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure.
- Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025
Automatic Metrics Coding
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
- MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025
Automatic Metrics General
We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
- Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025
Automatic Metrics Coding
Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
- The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf · Oct 10, 2025
Automatic Metrics General
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs.
- Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025
Automatic Metrics General
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025
Automatic Metrics Coding
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- HEART: Emotionally-Driven Test-Time Scaling of Language Models
Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty · Sep 26, 2025
Automatic Metrics Coding
We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making.
- PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca · Sep 15, 2025
Automatic Metrics Medicine
BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations.
- MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn · Sep 9, 2025
Automatic Metrics MedicineCoding
Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance.