- TaP: A Taxonomy-Guided Framework for Automated and Scalable Preference Data Generation
Renren Jin, Tianhao Shen, Xinwei Wu, Dan Shi, Haoran Sun · Jun 30, 2025
Pairwise Preference
Conducting supervised and preference fine-tuning of large language models (LLMs) requires high-quality datasets to improve their ability to follow instructions and align with human preferences and values.
- Unveiling Decision-Making in LLMs for Text Classification : Extraction of influential and interpretable concepts with Sparse Autoencoders
Mathis Le Bail, Jérémie Dentan, Davide Buscaldi, Sonia Vanier · Jun 30, 2025
These concepts are linear combinations of neuron activations that correspond to human-interpretable features.
- Complexity-aware fine-tuning
Andrey Goncharov, Daniil Vyazhev, Petr Sychev, Edvard Khalafyan, Alexey Zaytsev · Jun 26, 2025
General-purpose Large Language Models (LLMs) are frequently fine-tuned through supervised fine-tuning (SFT) to enhance performance in specific domains.
- Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
Jonathan Cook, Silvia Sapora, Arash Ahmadian, Akbir Khan, Tim Rocktaschel · Jun 23, 2025
Demonstrations
Though execution of instructions in training data remains less reliable than when instructions are given in-context, our results demonstrate that procedural knowledge can be noisily `programmed' into LLMs through PBB, with important implica
- Parallel Continuous Chain-of-Thought with Jacobi Iteration
Haoyi Wu, Zhihao Teng, Kewei Tu · Jun 23, 2025
Continuous chain-of-thought has been shown to be effective in saving reasoning tokens for large language models.
- PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents
Mikhail Menschikov, Dmitry Evseev, Victoria Dochkina, Ruslan Kostoev, Ilia Perepechkin · Jun 20, 2025
We evaluate our system on three benchmarks: TriviaQA, HotpotQA, DiaASQ and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task.
- DistillNote: Toward a Functional Evaluation Framework of LLM-Generated Clinical Note Summaries
Heloisa Oss Boll, Antonio Oss Boll, Leticia Puttlitz Boll, Ameen Abu Hanna, Iacer Calixto · Jun 20, 2025
Expert Verification
This study introduces DistillNote, an evaluation framework for LLM summaries that targets their functional utility by applying the generated summary downstream in a complex clinical prediction task, explicitly quantifying how much predictio
- A Scoping Review of Synthetic Data Generation by Language Models in Biomedical Research and Application: Data Utility and Quality Perspectives
Hanshu Rao, Weisi Liu, Haohan Wang, I-Chan Huang, Zhe He · Jun 19, 2025
Evaluations were heterogeneous: intrinsic metrics (27.1\%), human-in-the-loop assessments (44.1\%), and LLM-based evaluations (13.6\%).
- Revela: Dense Retriever Learning via Language Modeling
Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang · Jun 19, 2025
We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones.
- DeVisE: Behavioral Testing of Medical Large Language Models
Camila Zurdo Tagliabue, Heloisa Oss Boll, Aykut Erdem, Erkut Erdem, Iacer Calixto · Jun 18, 2025
Large language models (LLMs) are increasingly applied in clinical decision support, yet current evaluations rarely reveal whether their outputs reflect genuine medical reasoning or superficial correlations.
- $\texttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts
Mert Cemri, Nived Rajaraman, Rishabh Tiwari, Xiaoxuan Liu, Kurt Keutzer · Jun 15, 2025
Scaling test-time compute has driven the recent advances in the reasoning capabilities of large language models (LLMs), typically by allocating additional computation for more thorough exploration.
- Persona-driven Simulation of Voting Behavior in the European Parliament with Large Language Models
Maximilian Kreutner, Marlene Lutz, Markus Strohmaier · Jun 13, 2025
Large Language Models (LLMs) display remarkable capabilities to understand or even produce political discourse but have been found to consistently exhibit a progressive left-leaning bias.
- Spurious Rewards: Rethinking Training Signals in RLVR
Rulin Shao, Shuyue Stella Li, Rui Xin, Scott Geng, Yiping Wang · Jun 12, 2025
We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain language models even with spurious rewards that have little, no, or even negative correlation with the correct answer.
- Probabilistic distances-based hallucination detection in LLMs with RAG
Rodion Oblovatny, Alexandra Kuleshova, Konstantin Polev, Alexey Zaytsev · Jun 11, 2025
Detecting hallucinations in large language models (LLMs) is critical for their safety in many applications.
- ICE-ID: A Novel Historical Census Dataset for Longitudinal Identity Resolution
Gonçalo Hora de Carvalho, Lazar S. Popov, Sander Kaatee, Mário S. Correia, Kristinn R. Thórisson · Jun 11, 2025
We introduce \textbf{ICE-ID}, a benchmark dataset comprising 984,028 records from 16 Icelandic census waves spanning 220 years (1703--1920), with 226,864 expert-curated person identifiers.
- Towards Robust Real-World Multivariate Time Series Forecasting: A Unified Framework for Dependency, Asynchrony, and Missingness
Jinkwan Jang, Hyungjin Park, Jinmyeong Choi, Taesup Kim · Jun 10, 2025
Extensive experiments on public benchmark datasets reflecting practical settings, along with one private real-world industrial dataset, demonstrate the superior robustness and accuracy of ChannelTokenFormer under challenging real-world cond
- Structure-Augmented Reasoning Generation
Jash Rajesh Parekh, Pengcheng Jiang, Jiawei Han · Jun 10, 2025
Extensive experiments on open-domain QA benchmarks and specialized reasoning datasets in finance and medicine demonstrate that SARG significantly outperforms state-of-the-art flat-context RAG baselines in both factual accuracy and reasoning
- AbstRaL: Augmenting LLMs' Reasoning by Reinforcing Abstract Thinking
Silin Gao, Antoine Bosselut, Samy Bengio, Emmanuel Abbe · Jun 9, 2025
Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks.
- From Raw Corpora to Domain Benchmarks: Automated Evaluation of LLM Domain Expertise
Nitin Sharma, Thomas Wolfers, Çağatay Yıldız · Jun 9, 2025
Expert Verification
Accurate domain-specific benchmarking of LLMs is essential, specifically in domains with direct implications for humans, such as law, healthcare, and education.
- When Style Breaks Safety: Defending LLMs Against Superficial Style Alignment
Yuxin Xiao, Sana Tonekaboni, Walter Gerych, Vinith Suriyakumar, Marzyeh Ghassemi · Jun 9, 2025
Red Team
In this work, we seek to understand whether style patterns compromise LLM safety, how superficial style alignment increases model vulnerability, and how best to mitigate these risks during alignment.
- A dependently-typed calculus of event telicity and culminativity
Pavel Kovalev, Carlo Angiuli · Jun 8, 2025
We present a dependently-typed cross-linguistic framework for analyzing the telicity and culminativity of events, accompanied by examples of using our framework to model English sentences.
- DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation
Jingyu Xiao, Man Ho Lam, Ming Wang, Yuxuan Wan, Junliang Liu · Jun 6, 2025
However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream developme
- Simple Yet Effective: Extracting Private Data Across Clients in Federated Fine-Tuning of Large Language Models
Yingqi Hu, Zhuo Zhang, Jingyuan Zhang, Jinghua Wang, Qifan Wang · Jun 6, 2025
These findings highlight concrete privacy risks in FedLLMs and establish a benchmark and evaluation framework for future research on privacy-preserving federated learning.
- Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models
Cheonbok Park, Jeonghoon Kim, Joosung Lee, Sanghwan Bae, Jaegul Choo · Jun 6, 2025
Reinforcement learning with verifiable reward (RLVR) has been instrumental in eliciting strong reasoning capabilities from large language models (LLMs) via long chains of thought (CoT).
- When to use Graphs in RAG: A Comprehensive Analysis for Graph Retrieval-Augmented Generation
Zhishang Xiang, Chuanjie Wu, Qinggang Zhang, Shengyuan Chen, Zijin Hong · Jun 6, 2025
To address this, we propose GraphRAG-Bench, a comprehensive benchmark designed to evaluate GraphRAG models onboth hierarchical knowledge retrieval and deep contextual reasoning.
- Voice Impression Control in Zero-Shot TTS
Kenichi Fujita, Shota Horiguchi, Yusuke Ijima · Jun 6, 2025
The results of both objective and subjective evaluations have demonstrated our method's effectiveness in impression control.
- Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang · Jun 5, 2025
Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities.
- Resisting Contextual Interference in RAG via Parametric-Knowledge Reinforcement
Chenyu Lin, Yilin Wen, Du Su, Hexiang Tan, Fei Sun · Jun 5, 2025
Retrieval-augmented generation (RAG) improves performance on knowledge-intensive tasks but can be derailed by wrong, irrelevant, or conflicting retrieved text, causing models to rely on inaccurate evidence and cascade errors.
- Sensory-Motor Control with Large Language Models via Iterative Policy Refinement
Jônata Tyska Carvalho, Stefano Nolfi · Jun 5, 2025
We propose a method that enables large language models (LLMs) to control embodied agents through the generation of control policies that directly map continuous observation vectors to continuous action vectors.
- "Don't Do That!": Guiding Embodied Systems through Large Language Model-based Constraint Generation
Amin Seffo, Aladin Djuhera, Masataro Asai, Holger Boche · Jun 4, 2025
Web Browsing
Recent advancements in large language models (LLMs) have spurred interest in robotic navigation that incorporates complex spatial, mathematical, and conditional constraints from natural language into the planning problem.
- Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Apurv Verma, NhatHai Phan, Shubhendu Trivedi · Jun 4, 2025
In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection.
- Toward Beginner-Friendly LLMs for Language Learning: Controlling Difficulty in Conversation
Meiqing Jin, Liam Dugan, Chris Callison-Burch · Jun 4, 2025
We further introduce a new token-level evaluation metric, Token Miss Rate (TMR), that quantifies the proportion of incomprehensible tokens per utterance and correlates strongly with human judgments.
- HSSBench: Benchmarking Humanities and Social Sciences Ability for Multimodal Large Language Models
Zhaolu Kang, Junhao Gong, Jiaxu Yan, Wanke Xia, Yian Wang · Jun 4, 2025
Expert Verification
However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences
- EuroGEST: Investigating gender stereotypes in multilingual language models
Jacqueline Rowe, Mateusz Klimaszewski, Liane Guillou, Shannon Vallor, Alexandra Birch · Jun 4, 2025
Large language models increasingly support multiple languages, yet most benchmarks for gender bias remain English-centric.
- Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Xiaoying Zhang, Yipeng Zhang, Hao Sun, Kaituo Feng, Chaochao Lu · Jun 3, 2025
Critique Edit
Recent advances in reinforcement learning (RL) using numerical rewards have significantly enhanced the complex reasoning capabilities of large language models (LLMs).
- Automated Web Application Testing: End-to-End Test Case Generation with Large Language Models and Screen Transition Graphs
Nguyen-Khang Le, Quan Minh Bui, Minh Ngoc Nguyen, Hiep Nguyen, Trung Vo · Jun 3, 2025
Web Browsing
Web applications are critical to modern software ecosystems, yet ensuring their reliability remains challenging due to the complexity and dynamic nature of web interfaces.
- Esoteric Language Models: Bridging Autoregressive and Masked Diffusion LLMs
Subham Sekhar Sahoo, Zhihan Yang, Yash Akhauri, Johnna Liu, Deepansha Singh · Jun 2, 2025
Diffusion-based language models offer a compelling alternative to autoregressive (AR) models by enabling parallel and controllable generation.
- iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering
Shuai Wang, Yinan Yu · Jun 2, 2025
Long Horizon
Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.
- Synthesis of discrete-continuous quantum circuits with multimodal diffusion models
Florian Fürrutter, Zohim Chandani, Ikko Hamamura, Hans J. Briegel, Gorka Muñoz-Gil · Jun 2, 2025
We benchmark the model over different experiments, analyzing the method's accuracy across varying qubit counts and circuit depths, showcasing the ability of the method to outperform existing approaches in gate counts and under noisy conditi
- When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025
To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
- Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025
Pairwise Preference
Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025
Red Team Web Browsing
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
- PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
- FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025
Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
- Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning.
- HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025
On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
- Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025
Red Team
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
- Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.
- Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025
Pairwise Preference
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025
Pairwise Preference
However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
- Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025
Automated fact-checking has been a challenging task for the research community.
- Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?
- Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation.
- What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025
We further benchmark against other English deception datasets following similar data collection protocols.
- Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart · May 19, 2025
The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are can
- BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng · May 18, 2025
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning.
- EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang · May 17, 2025
Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs).
- Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang · May 16, 2025
Web Browsing
In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.