- Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
- SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025
This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
- Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025
Pairwise Preference
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
- Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025
Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
- mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar · Sep 27, 2025
We build upon vec2vec, a procedure designed to align text embedding spaces without parallel data.
- PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Ziwei He · Sep 27, 2025
The remarkable success of Chain-of-Thought (CoT), which enhances performance by scaling generation steps at test-time, inspires us to ask: can we leverage a similar scaling of computational steps during pretraining to improve the generation
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025
Long Horizon
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li, Changdae Oh, Sharon Li · Sep 27, 2025
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism.
- Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai · Sep 27, 2025
To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning.
- HEART: Emotionally-Driven Test-Time Scaling of Language Models
Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty · Sep 26, 2025
We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making.
- From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025
Despite its recent successes, Deep Reinforcement Learning (DRL) is notoriously sample-inefficient.
- FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Haorui Chen, Chengze Li, Jia Li · Sep 26, 2025
However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a significant challenge.
- LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025
A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
- SciTS: Scientific Time Series Understanding and Generation with LLMs
Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Jimin Zhuang · Sep 26, 2025
To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from $10^0$ to $10^7$ in length and up to 10~MHz in frequency.
- CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025
Post-training compression of large language models (LLMs) often relies on low-rank weight approximations that represent each column of the weight matrix in a shared low-dimensional subspace.
- Fine-tuning Done Right in Model Editing
Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao · Sep 26, 2025
Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing.
- ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025
Pairwise Preference
In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
- Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025
Rubric Rating
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
- UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay · Sep 25, 2025
Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality.
- EpidemIQs: Prompt-to-Paper LLM Agents for Epidemic Modeling and Analysis
Mohammad Hossein Samaei, Faryad Darabi Sahneh, Lee W. Cohnstaedt, Caterina Scoglio · Sep 24, 2025
Expert Verification Multi Agent
We introduce EpidemIQs, a novel multi-agent LLM framework that integrates user inputs and autonomously conducts literature review, analytical derivation, network modeling, mechanistic modeling, stochastic simulations, data visualization and
- Diversity Boosts AI-Generated Text Detection
Advik Raj Basani, Pin-Yu Chen · Sep 23, 2025
Motivated by the observation that human-authored text exhibits richer variability in lexical and structural unpredictability than LLM outputs, DivEye captures this signal through a set of interpretable statistical features.