- What If We Allocate Test-Time Compute Adaptively?
Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan · Feb 1, 2026
Long Horizon
For each problem, the agent runs multiple inference iterations.
- From Associations to Activations: Comparing Behavioral and Hidden-State Semantic Geometry in LLMs
Louis Schiekiera, Max Zimmer, Christophe Roux, Sebastian Pokutta, Fritz Günther · Jan 31, 2026
Using representational similarity analysis, we compare behavioral geometries to layerwise hidden-state similarity and benchmark against FastText, BERT, and cross-model consensus.
- Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs
Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li · Jan 31, 2026
Multi Agent
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence.
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026
Long Horizon
While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
- Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal, Deeksha Varshney, Mamta, Asif Ekbal · Jan 29, 2026
We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages.
- INSURE-Dial: A Phase-Aware Conversational Dataset & Benchmark for Compliance Verification and Phase Detection
Shubham Kulkarni, Alexander Lyzhov, Preetam Joshi, Shiva Chaitanya · Jan 28, 2026
Web Browsing
We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification.
- Understanding LLM Failures: A Multi-Tape Turing Machine Analysis of Systematic Errors in Language Model Reasoning
Magnus Boman · Jan 27, 2026
Large language models (LLMs) exhibit failure modes on seemingly trivial tasks.
- One Token Is Enough: Improving Diffusion Language Models with a Sink Token
Zihou Zhang, Zheyong Xie, Li Zhong, Haifeng Liu, Yao Hu · Jan 27, 2026
Diffusion Language Models (DLMs) have emerged as a compelling alternative to autoregressive approaches, enabling parallel text generation with competitive performance.
- FROST: Filtering Reasoning Outliers with Attention for Efficient Reasoning
Haozheng Luo, Zhuolin Jiang, Md Zahid Hasan, Yan Chen, Soumalya Sarkar · Jan 26, 2026
Empirically, we validate FROST on four benchmarks using two strong reasoning models (Phi-4-Reasoning and GPT-OSS-20B), outperforming state-of-the-art methods such as TALE and ThinkLess.
- Flatter Tokens are More Valuable for Speculative Draft Model Training
Jiaming Fan, Daming Cao, Xiangzhong Luo, Jiale Fu, Chonghan Liu · Jan 26, 2026
Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset.