- Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025
Large Language Models (LLMs) excel at problem solving by generating chain of thoughts in natural language, but such verbal thinking is computationally costly and prone to overthinking.
- LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025
Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.
- Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025
Reinforcement learning fine-tuning (RLFT) is a dominant paradigm for improving pretrained policies for downstream tasks.
- Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma, Nolan Dey, Joel Hestness · Sep 29, 2025
We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*.
- Generative Value Conflicts Reveal LLM Priorities
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh · Sep 29, 2025
Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended v
- Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025
Large language models (LLMs) are increasingly used in modern search and answer systems to synthesize multiple, sometimes conflicting, texts into a single response, yet current pipelines offer weak incentives for sources to be accurate and a
- TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang · Sep 29, 2025
TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs.
- Inducing Dyslexia in Vision Language Models
Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf · Sep 29, 2025
Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses.
- Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
- SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025
This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
- Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025
Pairwise Preference
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
- Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025
Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
- mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar · Sep 27, 2025
We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data.
- PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025
The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025
Long Horizon
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li, Changdae Oh, Sharon Li · Sep 27, 2025
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism.
- Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai · Sep 27, 2025
To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning.
- HEART: Emotionally-Driven Test-Time Scaling of Language Models
Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty · Sep 26, 2025
We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making.
- From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025
Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient.
- FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Haorui Chen, Chengze Li, Jia Li · Sep 26, 2025
However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge.
- LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025
A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
- SciTS: Scientific Time Series Understanding and Generation with LLMs
Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Jimin Zhuang · Sep 26, 2025
To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency.
- CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
- Fine-tuning Done Right in Model Editing
Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao · Sep 26, 2025
Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing.
- ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025
Pairwise Preference
In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
- Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025
Rubric Rating
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
- UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay · Sep 25, 2025
Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality.
- EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025
Expert Verification Multi Agent
We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- Diversity Boosts AI-Generated Text Detection
Advik Raj Basani, Pin-Yu Chen · Sep 23, 2025
Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features.
- KANO: Kolmogorov-Arnold Neural Operator
Jin Lee, Ziming Liu, Xinling Yu, Yixuan Wang, Haewon Jeong · Sep 20, 2025
In the quantum Hamiltonian learning benchmark, KANO reconstructs ground-truth Hamiltonians in closed-form symbolic representations accurate to the fourth decimal place in coefficients and attains $\approx 6\times10^{-6}$ state infidelity fr
- Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation
Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti · Sep 18, 2025
Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges.
- ATTS: Asynchronous Test-Time Scaling via Conformal Prediction
Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng · Sep 18, 2025
Large language models (LLMs) benefit from test-time scaling but are often hampered by high inference latency.
- ClearFairy: Capturing Creative Workflows through Decision Structuring, In-Situ Questioning, and Rationale Inference
Kihoon Son, DaEun Choi, Tae Soo Kim, Young-Ho Kim, Sangdoo Yun · Sep 18, 2025
Critique Edit
Furthermore, exploratory applications demonstrate that captured steps can enhance generative AI agents in Figma, yielding predictions better aligned with professionals and producing coherent outcomes.
- A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness
Xuan Luo, Yue Wang, Zefeng He, Geng Tu, Jing Li · Sep 17, 2025
Red Team
This study reveals a critical safety blind spot in modern LLMs: learning-style queries, which closely resemble ordinary educational questions, can reliably elicit harmful responses.
- The AI Memory Gap: Users Misremember What They Created With AI or Without
Tim Zindulka, Sven Goller, Daniela Fernandes, Robin Welsch, Daniel Buschek · Sep 15, 2025
Our findings reveal a significant gap in memory: After AI use, the odds of correct attribution dropped, with the steepest decline in mixed human-AI workflows, where either the idea or elaboration was created with AI.
- Collaborative Document Editing with Multiple Users and AI Agents
Florian Lehmann, Krystsina Shauchenka, Daniel Buschek · Sep 15, 2025
Multi Agent
We propose integrating AI agents directly into collaborative writing environments.
- PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams -- Dataset Construction and Evaluation
Rodrigo M. Carrillo-Larco, Jesus Lovón Melgarejo, Manuel Castillo-Cara, Gusseppe Bravo-Rocca · Sep 15, 2025
BACKGROUND: Medical large language models (LLMs) have demonstrated remarkable performance in answering medical examinations.
- CogniAlign: Survivability-Grounded Multi-Agent Moral Reasoning for Safe and Transparent AI
Hasin Jawad Ali, Ilhamul Azam, Ajwad Abrar, Md. Kamrul Hasan, Hasan Mahmud · Sep 14, 2025
Multi Agent
The challenge of aligning artificial intelligence (AI) with human values persists due to the abstract and often conflicting nature of moral principles and the opacity of existing approaches.
- MedicalPatchNet: A Patch-Based Self-Explainable AI Architecture for Chest X-ray Classification
Patrick Wienholt, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn · Sep 9, 2025
Deep neural networks excel in radiological image classification but frequently suffer from poor interpretability, limiting clinical acceptance.
- Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios
Jingen Qu, Lijun Li, Bo Zhang, Yichen Yan, Jing Shao · Sep 4, 2025
Multimodal large language models (MLLMs) are rapidly evolving, presenting increasingly complex safety challenges.
- Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection
Shan Wang, Maying Shen, Nadine Chang, Chuong Nguyen, Hongdong Li · Sep 3, 2025
Experiments across multiple benchmarks demonstrate that GACD effectively reduces hallucinations and improves the visual grounding of MLLM outputs.
- Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR
Jiaming Li, Longze Chen, Ze Gong, Yukun Chen, Lu Wang · Sep 2, 2025
Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have empowered large language models (LLMs) to tackle challenging reasoning tasks such as mathematics and programming.
- Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba, Edward Raff, Ponnurangam Kumaraguru · Sep 2, 2025
To address these questions, we conduct controlled experiments across multiple explanation benchmark datasets (general and domain-specific) and label definition conditions, including expert-curated, LLM-generated, perturbed, and swapped defi
- Error Notebook-Guided, Training-Free Part Retrieval in 3D CAD Assemblies via Vision-Language Models
Yunqing Liu, Nan Zhang, Zhiming Tan · Sep 1, 2025
Pairwise Preference Long Horizon
We additionally contribute a CAD dataset with human preference annotations.
- EO-1: An Open Unified Embodied Foundation Model for General Robot Control
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao · Aug 28, 2025
Long Horizon
The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general purpose embodied intelligent systems.
- NPG-Muse: Scaling Long Chain-of-Thought Reasoning with NP-Hard Graph Problems
Yuyao Wang, Bowen Liu, Jianheng Tang, Nuo Chen, Yuhan Li · Aug 28, 2025
However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored.
- Diffusion Language Models Know the Answer Before Decoding
Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan · Aug 27, 2025
Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality.
- Your AI Bosses Are Still Prejudiced: The Emergence of Stereotypes in LLM-Based Multi-Agent Systems
Jingyu Guo, Yingying Xu · Aug 27, 2025
Multi Agent
While stereotypes are well-documented in human social interactions, AI systems are often presumed to be less susceptible to such biases.
- Language and Experience: A Computational Model of Social Learning in Complex Tasks
Cédric Colas, Tracey Mills, Ben Prystawski, Michael Henry Tessler, Noah Goodman · Aug 26, 2025
The ability to combine linguistic guidance from others with direct experience is central to human development, enabling safe and rapid learning in new environments.
- Hybrid Deep Searcher: Scalable Parallel and Sequential Search Reasoning
Dayoon Ko, Jihyuk Kim, Haeju Park, Sohyeon Kim, Dahyun Lee · Aug 26, 2025
Long Horizon
Large reasoning models (LRMs) combined with retrieval-augmented generation (RAG) have enabled deep research agents capable of multi-step reasoning with external knowledge retrieval.
- Why Synthetic Isn't Real Yet: A Diagnostic Framework for Contact Center Dialogue Generation
Rishikesh Devanathan, Varun Nathan, Ayush Kumar · Aug 25, 2025
In this work, we benchmark multiple generation strategies guided by structured supervision on call attributes (Intent Summaries, Topic Flows, and Quality Assurance (QA) Forms) across multiple languages.
- Classification errors distort findings in automated speech processing: examples and solutions from child-development research
Lucas Gautheron, Evan Kidd, Anton Malko, Marvin Lavechin, Alejandrina Cristia · Aug 21, 2025
With the advent of wearable recorders, scientists are increasingly turning to automated methods of analysis of audio and video data in order to measure children's experience, behavior, and outcomes, with a sizable literature employing long-
- HebID: Detecting Social Identities in Hebrew-language Political Text
Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav · Aug 21, 2025
We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro-$F_1$ = 0.74).
- Tokens with Meaning: A Hybrid Tokenization Approach for Turkish
M. Ali Bayram, Ali Arda Fincan, Ahmet Semih Gümüş, Sercan Karakaş, Banu Diri · Aug 19, 2025
We further validate practical utility with downstream sentence embedding benchmarks under a strict \emph{random initialization} control to isolate tokenizer inductive bias.
- Depth-Breadth Synergy in RLVR: Unlocking LLM Reasoning Gains with Adaptive Exploration
Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Dongchun Xie · Aug 19, 2025
Reinforcement Learning with Verifiable Reward (RLVR) has emerged as a powerful paradigm for unlocking reasoning capabilities in large language models, yet its full potential is hindered by two under-explored dimensions: Depth-the hardest pr
- TASER: Table Agents for Schema-guided Extraction and Recommendation
Nicole Cho, Kirsty Fielding, William Watson, Sumitra Ganesh, Manuela Veloso · Aug 18, 2025
Critique Edit
To address this, we present TASER (Table Agents for Schema-guided Extraction and Recommendation), a continuously learning, agentic table extraction system that converts highly unstructured, multi-page, heterogeneous tables into normalized,
- SNAP-UQ: Self-supervised Next-Activation Prediction for Single-Pass Uncertainty in TinyML
Ismail Lamaakal, Chaymae Yahyati, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh · Aug 18, 2025
Reliable uncertainty estimation is a key missing piece for on-device monitoring in TinyML: microcontrollers must detect failures, distribution shift, or accuracy drops under strict flash/latency budgets, yet common uncertainty approaches (d
- CORE: Measuring Multi-Agent LLM Interaction Quality under Game-Theoretic Pressures
Punya Syon Pandey, Yongjin Yang, Jiarui Liu, Zhijing Jin · Aug 16, 2025
Pairwise Preference Multi Agent
Game-theoretic interactions between agents with Large Language Models (LLMs) have revealed many emergent capabilities, yet the linguistic diversity of these interactions has not been sufficiently quantified.
- UbiQTree: Uncertainty Quantification in XAI with Tree Ensembles
Akshat Dubey, Aleksandar Anžel, Bahar İlgen, Georges Hattab · Aug 13, 2025
Explainable Artificial Intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), have become essential tools for interpreting complex ensemble tree-based models, especially in high-stakes domains such as healthcare analyt
- 1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
Wenkai Li, Liwen Sun, Zhenxiang Guan, Xuhui Zhou, Maarten Sap · Aug 11, 2025
Multi Agent
We introduce a multi-agent framework that decomposes privacy reasoning into specialized subtasks (extraction, classification), reducing the information load on any single agent while enabling iterative validation and more reliable adherence