- $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025
Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
- Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.
- Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
- Probabilistic distances-based hallucination detection in LLMs with RAG
Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, Alexey Zaytsev · Jun 11, 2025
Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications.
- ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson · Jun 11, 2025
We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers.
- Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim · Jun 10, 2025
Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world cond
- Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025
Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
- AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025
Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
- From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025
Expert Verification
Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
- When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025
Red Team
In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.