- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching
Roy Miles, Aysim Toker, Andreea-Maria Oncescu, Songcen Xu, Jiankang Deng · Feb 26, 2026
Automatic Metrics MathCoding
This modular pipeline separates exploration (diffusion) from evaluation and solution synthesis, avoiding monolithic unified hybrids while preserving broad search.
- Deepfake Word Detection by Next-token Prediction using Fine-tuned Whisper
Hoan My Tran, Xin Wang, Wanying Ge, Xuechen Liu, Junichi Yamagishi · Feb 26, 2026
Automatic Metrics Coding
Deepfake speech utterances can be forged by replacing one or more words in a bona fide utterance with semantically different words synthesized by speech generative models.
- How Do Latent Reasoning Methods Perform Under Weak and Strong Supervision?
Yingqian Cui, Zhenwei Dai, Bing He, Zhan Shi, Hui Liu · Feb 25, 2026
Automatic Metrics Coding
Latent reasoning has been recently proposed as a reasoning paradigm and performs multi-step reasoning through generating steps in the latent space instead of the textual space.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026
Automatic Metrics Coding
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- Why Pass@k Optimization Can Degrade Pass@1: Prompt Interference in LLM Post-training
Anas Barakat, Souradip Chakraborty, Khushbu Pahwa, Amrit Singh Bedi · Feb 24, 2026
Automatic Metrics MathCoding
Pass@k is a widely used performance metric for verifiable large language model tasks, including mathematical reasoning, code generation, and short-answer reasoning.
- SparkMe: Adaptive Semi-Structured Interviewing for Qualitative Insight Discovery
David Anugraha, Vishakh Padmakumar, Diyi Yang · Feb 24, 2026
Automatic Metrics Coding
Based on this formulation, we introduce SparkMe, a multi-agent LLM interviewer that performs deliberative planning via simulated conversation rollouts to select questions with high expected utility.
- HiSAC: Hierarchical Sparse Activation Compression for Ultra-long Sequence Modeling in Recommenders
Kun Yuan, Junyu Bi, Daixuan Cheng, Changfa Wu, Shuwen Xiao · Feb 24, 2026
Automatic Metrics Coding
Modern recommender systems leverage ultra-long user behavior sequences to capture dynamic preferences, but end-to-end modeling is infeasible in production due to latency and memory constraints.
- Continuous Telemonitoring of Heart Failure using Personalised Speech Dynamics
Yue Pan, Xingyao Wang, Hanyue Zhang, Liwei Liu, Changxin Li · Feb 23, 2026
Automatic Metrics MedicineCoding
The model's high sensitivity was further corroborated by additional follow-up data, confirming its efficacy in predicting HF deterioration and its potential to secure patient safety in remote, home-based settings.
- Nacrith: Neural Lossless Compression via Ensemble Context Modeling and High-Precision CDF Coding
Roberto Tacconelli · Feb 23, 2026
Automatic Metrics Coding
An out-of-distribution (OOD) evaluation on a document published after the model's training cutoff confirms these gains are not memorization artifacts, achieving 0.723 bpb on unseen text.
- Can Large Language Models Replace Human Coders? Introducing ContentBench
Michael Haman · Feb 23, 2026
Automatic Metrics Coding
This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question by tracking how much agreement low-cost LLMs achieve and what they cost on the same interpretive coding tasks.
- Do LLMs and VLMs Share Neurons for Inference? Evidence and Mechanisms of Cross-Modal Transfer
Chenhang Cui, An Zhang, Yuxin Chen, Gelei Deng, Jingnan Zheng · Feb 22, 2026
Automatic Metrics MathCoding
Across diverse mathematics and perception benchmarks, SNRF consistently enhances LVLM inference performance while preserving perceptual capabilities.
- Watermarking LLM Agent Trajectories
Wenlong Meng, Chen Gong, Terry Yue Zhuo, Fan Zhang, Kecen Li · Feb 21, 2026
Automatic Metrics MathCoding
LLM agents rely heavily on high-quality trajectory data to guide their problem-solving behaviors, yet producing such data requires substantial task design, high-capacity model generation, and manual filtering.
- Luna-2: Scalable Single-Token Evaluation with Small Language Models
Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel · Feb 20, 2026
Automatic Metrics Coding
Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation.
- Sink-Aware Pruning for Diffusion Language Models
Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen · Feb 19, 2026
Automatic Metrics Coding
Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning.
- Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents
Wenxuan Ding, Nicholas Tomlin, Greg Durrett · Feb 18, 2026
Simulation Env Coding
Each problem has latent environment state that can be reasoned about via a prior which is passed to the LLM agent.
- Supercharging Agenda Setting Research: The ParlaCAP Dataset of 28 European Parliaments and a Scalable Multilingual LLM-Based Classification
Taja Kuzman Pungeršek, Peter Rupnik, Daniela Širinić, Nikola Ljubešić · Feb 18, 2026
Human Eval CodingMultilingual
Agreement between the LLM and human annotators is comparable to inter-annotator agreement among humans, and the resulting model outperforms existing CAP classifiers trained on manually-annotated but out-of-domain data.
- MAEB: Massive Audio Embedding Benchmark
Adnan El Assadi, Isaac Chung, Chenghao Xiao, Roman Solomatin, Animesh Jha · Feb 17, 2026
Simulation Env CodingMultilingual
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages.
- Extracting Consumer Insight from Text: A Large Language Model Approach to Emotion and Evaluation Measurement
Stephan Ludwig, Peter J. Danaher, Xiaohao Yang, Yu-Ting Lin, Ehsan Abedin · Feb 17, 2026
Automatic Metrics Coding
Accurately measuring consumer emotions and evaluations from unstructured text remains a core challenge for marketing research and practice.
- Breaking Data Efficiency Dilemma: A Federated and Augmented Learning Framework For Alzheimer's Disease Detection via Speech
Xiao Wei, Bin Wen, Yuqin Lin, Kai Li, Mingyang gu · Feb 16, 2026
Automatic Metrics MedicineCoding
Early diagnosis of Alzheimer's Disease (AD) is crucial for delaying its progression.
- Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering
Tao Xu · Feb 15, 2026
Automatic Metrics Coding
16.1\% (+14.5pp); on CircuitVQA, a public benchmark (9,315 questions), retrieval ImgR@3 achieves 31.2\% vs.
- Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao · Feb 11, 2026
Simulation Env MathCoding
We introduce Step 3.5 Flash, a sparse Mixture-of-Experts (MoE) model that bridges frontier-level agentic intelligence and computational efficiency.
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026
Simulation Env Coding
While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning.
- QSTN: A Modular Framework for Robust Questionnaire Inference with Large Language Models
Maximilian Kreutner, Jens Rupprecht, Georg Ahnert, Ahmed Salem, Markus Strohmaier · Dec 9, 2025
Automatic Metrics Coding
QSTN enables robust evaluation of questionnaire presentation, prompt perturbations, and response generation methods.
- Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers
Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li · Dec 3, 2025
Automatic Metrics Coding
Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths.
- Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025
Automatic Metrics Coding
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
- PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025
Automatic Metrics Coding
The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
- Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025
Automatic Metrics MathCoding
Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
- Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025
Automatic Metrics Coding
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
- "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025
Simulation Env MathCoding
Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
- Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields, Casey Kennington · Nov 11, 2024
Automatic Metrics Coding
To conduct these experiments, we introduce a VL evaluation framework called Renaissance.
- LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang · Nov 7, 2024
Automatic Metrics Coding
The LLM-enhanced CLIP delivers consistent improvements across a wide range of downstream tasks, including linear-probe classification, zero-shot image-text retrieval with both short and long captions (in English and other languages), zero-s
- Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning
Yong Liu, Zirui Zhu, Chaoyu Gong, Minhao Cheng, Cho-Jui Hsieh · Feb 24, 2024
Automatic Metrics Coding
While fine-tuning large language models (LLMs) for specific tasks often yields impressive results, it comes at the cost of memory inefficiency due to back-propagation in gradient-based training.