- Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links
Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson · Sep 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025 · Citations: 0
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
- VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu · Sep 28, 2025 · Citations: 0
- SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025 · Citations: 0
This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
- From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin · Sep 28, 2025 · Citations: 0
Multi Agent
In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task.
- Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025 · Citations: 0
Pairwise Preference
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
- M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li · Sep 28, 2025 · Citations: 0
- AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
Junyou Wang, Zehua Chen, Binjie Yuan, Kaiwen Zheng, Chang Li · Sep 28, 2025 · Citations: 0
- Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025 · Citations: 0
Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
- Internal Planning in Language Models: Characterizing Horizon and Branch Awareness
Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, Guannan Qu · Sep 28, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra · Sep 27, 2025 · Citations: 0
However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized.
- Mapping Overlaps in Benchmarks through Perplexity in the Wild
Siyang Wu, Honglin Bao, Sida Li, Ari Holtzman, James A. Evans · Sep 27, 2025 · Citations: 0
We introduce benchmark signatures to characterize the capacity demands of LLM benchmarks and their overlaps.
- Your Models Have Thought Enough: Training Large Reasoning Models to Stop Overthinking
Jinyi Han, Ying Huang, Ying Liao, Zishang Jiang, Xikun Lu · Sep 27, 2025 · Citations: 0
Long Horizon
Especially, DeepSeek-Distill-Qwen-1.5B achieves a 4.6% accuracy gain while reducing output length by 46.3% on the Olympiad benchmark.
- Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization
Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng · Sep 27, 2025 · Citations: 0
Pairwise Preference
To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training.
- Dual-Space Smoothness for Robust and Balanced LLM Unlearning
Han Yan, Zheyuan Liu, Meng Jiang · Sep 27, 2025 · Citations: 0
Red Team
As large language models evolve, Machine Unlearning has emerged to address growing concerns around user privacy, copyright infringement, and overall safety.
- Learning to Reason in Structured In-context Environments with Reinforcement Learning
Peng Yu, Zeyuan Zhao, Shao Zhang, Luoyi Fu, Xinbing Wang · Sep 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling
Jonas Ngnawé, Maxime Heuillet, Sabyasachi Sahoo, Yann Pequignot, Ola Ahmad · Sep 27, 2025 · Citations: 0
- mini-vec2vec: Scaling Universal Geometry Alignment with Linear Transformations
Guy Dar · Sep 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- p-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
Runyan Tan, Shuang Wu, Phillip Howard · Sep 27, 2025 · Citations: 0
- AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms
Zhenxing Xu, Yizhe Zhang, Weidong Bao, Hao Wang, Ming Chen · Sep 27, 2025 · Citations: 0
Evaluated on three distinct metaheuristics across diverse combinatorial optimization benchmarks, AutoEP consistently outperforms state-of-the-art tuners, including neural evolution and other LLM-based methods.
- PonderLM-2: Pretraining LLM with Latent Thoughts in Continuous Space
Boyi Zeng, He Li, Shixiang Song, Yixuan Wang, Zitong Wang · Sep 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Test-Time Policy Adaptation for Enhanced Multi-Turn Interactions with LLMs
Chenxing Wei, Hong Wang, Ying He, Fei Yu, Yao Shu · Sep 27, 2025 · Citations: 0
Pairwise Preference
To address this limitation, we first propose a new paradigm: Test-Time Policy Adaptation for Multi-Turn Interactions (T2PAM), which utilizes user feedback from the ongoing interaction as a reward signal to estimate a latent optimal policy…
- Non-Collaborative User Simulators for Tool Agents
Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, Yohan Jo · Sep 27, 2025 · Citations: 0
- RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
Haoyu He, Haozheng Luo, Yan Chen, Qi R. Wang · Sep 27, 2025 · Citations: 0
Long Horizon
Predicting human mobility is inherently challenging due to complex long-range dependencies and multi-scale periodic behaviors.
- General Exploratory Bonus for Optimistic Exploration in RLHF
Wendi Li, Changdae Oh, Sharon Li · Sep 27, 2025 · Citations: 0
Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism.
- Blind to Position, Biased in Language: Probing Mid-Layer Representational Bias in Vision-Language Encoders for Zero-Shot Language-Grounded Spatial Understanding
Na Min An, Inha Kang, Minhyun Lee, Hyunjung Shim · Sep 27, 2025 · Citations: 0
Motivated by these findings, we identify an underexplored pathway within VLE mid-layers to construct a spatial map, applicable for improving zero-shot RIS by 1-7 mIoU on nine RefCOCO benchmarks.
- d$^2$Cache: Accelerating Diffusion-Based LLMs via Dual Adaptive Caching
Yuchu Jiang, Yue Cai, Xiangzhong Luo, Jiale Fu, Jiarui Wang · Sep 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Semantic Voting: A Self-Evaluation-Free Approach for Efficient LLM Self-Improvement on Unverifiable Open-ended Tasks
Chunyang Jiang, Yonggang Zhang, Yiyang Cai, Chi-Min Chan, Yulong Liu · Sep 27, 2025 · Citations: 0
As a result, self-evaluation mechanisms (e.g., self-judging and entropy minimization) are predominantly used to derive pseudo-labels.
- Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents
Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai · Sep 27, 2025 · Citations: 0
To tackle these challenges, we present ReMemR1, which integrates the mechanism of memory retrieval into the memory update process, enabling the agent to selectively callback historical memories for non-linear reasoning.
- AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents
Hossein Sholehrasa, Amirhossein Ghanaatian, Doina Caragea, Lisa A. Tell, Jim E. Riviere · Sep 26, 2025 · Citations: 0
Pharmacokinetics (PK) plays a critical role in drug development and regulatory decision-making for human and veterinary medicine, directly affecting public health through drug safety and efficacy assessments.
- Induction Signatures Are Not Enough: A Matched-Compute Study of Load-Bearing Structure in In-Context Learning
Mohammed Sabry, Anya Belz · Sep 26, 2025 · Citations: 0
Across 0.13B-1B decoder-only models, we evaluate (i) few-shot performance on standard LM benchmarks and function-style ICL probes, (ii) head-level copy telemetry, and (iii) held-out perplexity as a guardrail.
- Compute-Optimal Quantization-Aware Training
Aleksandr Dremov, David Grangier, Angelos Katharopoulos, Awni Hannun · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings
Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton · Sep 26, 2025 · Citations: 0
- HEART: Emotionally-Driven Test-Time Scaling of Language Models
Gabriela Pinto, Palash Goyal, Mihir Parmar, Yiwen Song, Souradip Chakraborty · Sep 26, 2025 · Citations: 0
We introduce HEART, a framework that uses emotional cues to guide the model's focus, much like how feelings contribute to human decision-making.
- Critique-Coder: Enhancing Coder Models by Critique Reinforcement Learning
Chi Ruan, Dongfu Jiang, Yubo Wang, Wenhu Chen · Sep 26, 2025 · Citations: 0
Critique Edit
We fine-tune multiple models (Critique-Coder) and evaluate them on different benchmarks to show their advantages over RL-only models.
- Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty · Sep 26, 2025 · Citations: 0
Pairwise Preference
We investigate the relationship between this notion of creativity and n-gram novelty through 8,618 expert writer annotations of novelty, pragmaticality, and sensicality via close reading of human- and AI-generated text.
- StateX: Enhancing RNN Recall via Post-training State Expansion
Xingyu Shen, Yingfa Chen, Zhen Leng Thai, Xu Han, Zhiyuan Liu · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- IA2: Alignment with ICL Activations Improves Supervised Fine-Tuning
Aayush Mishra, Daniel Khashabi, Anqi Liu · Sep 26, 2025 · Citations: 0
Demonstrations
Performing IA2 as a priming step before SFT significantly improves the accuracy and calibration of model outputs, as shown by our extensive empirical results on 12 popular benchmarks and two model families.
- Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective
Siwei Wang, Yifei Shen, Haoran Sun, Shi Feng, Shang-Hua Teng · Sep 26, 2025 · Citations: 0
Finally, applying our framework to the real-world planning benchmark Blocksworld, we confirm that these behaviors manifest in practice.
- From Formal Language Theory to Statistical Learning: Finite Observability of Subregular Languages
Katsuhiko Hayashi, Hidetaka Kamigaito · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- From Parameters to Behaviors: Unsupervised Compression of the Policy Space
Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation
Shichao Weng, Zhiqiang Wang, Yuhua Zhou, Rui Lu, Ting Liu · Sep 26, 2025 · Citations: 0
- Bridging Kolmogorov Complexity and Deep Learning: Asymptotically Optimal Description Length Objectives for Transformers
Peter Shaw, James Cohan, Jacob Eisenstein, Kristina Toutanova · Sep 26, 2025 · Citations: 0
Demonstrations
The Minimum Description Length (MDL) principle offers a formal framework for applying Occam's razor in machine learning.
- FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation
Haorui Chen, Chengze Li, Jia Li · Sep 26, 2025 · Citations: 0
To address these limitations, we propose a new benchmark - FeatBench, which introduces the following advances: (1) Realistic Task Inputs.
- LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning
Tiago Fernandes Tavares · Sep 26, 2025 · Citations: 0
A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture.
- Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding
Shijing Hu, Jingyang Li, Zhihui Lu, Pan Zhou · Sep 26, 2025 · Citations: 0
- SciTS: Scientific Time Series Understanding and Generation with LLMs
Wen Wu, Ziyang Zhang, Liwei Liu, Xuenan Xu, Jimin Zhuang · Sep 26, 2025 · Citations: 0
To address these gaps, we introduce SciTS, a benchmark spanning 12 scientific domains and 43 tasks, with over 50k+ instances, both univariate and multivariate signals ranging from 10^0 to 10^7 in length and up to 10~MHz in frequency.
- SecureVibeBench: Evaluating Secure Coding Capabilities of Code Agents with Realistic Vulnerability Scenarios
Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi · Sep 26, 2025 · Citations: 0
Large language model-powered code agents are rapidly transforming software engineering, yet the security risks of their generated code have become a critical concern.
- CoSpaDi: Compressing LLMs via Calibration-Guided Sparse Dictionary Learning
Denis Makhov, Dmitriy Shopkhoev, Magauiya Zhussip, Ammar Ali, Stamatios Lefkimmiatis · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Fine-tuning Done Right in Model Editing
Wanli Yang, Rui Tang, Hongyu Zang, Du Su, Qi Cao · Sep 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference
Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella · Sep 26, 2025 · Citations: 0
Long Horizon
In this paper, we introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement.
- ERGO: Efficient High-Resolution Visual Understanding for Vision-Language Models
Jewon Lee, Wooksu Shin, Seungmin Yang, Ki-Ung Song, DongUk Lim · Sep 26, 2025 · Citations: 0
For instance, ERGO surpasses Qwen2.5-VL-7B on the V* benchmark by 4.7 points while using only 23% of the vision tokens, achieving a 3x inference speedup.
- Leveraging Wireless Sensor Networks for Real-Time Monitoring and Control of Industrial Environments
Muhammad Junaid Asif, Abdul Rehman, Asim Mehmood, Rana Fayyaz Ahmad, Shazia Saqib · Sep 26, 2025 · Citations: 0
- ProPerSim: Developing Proactive and Personalized AI Assistants through User-Assistant Simulation
Jiho Kim, Junseong Choi, Woosog Chay, Daeun Kyung, Yeonsu Kwon · Sep 26, 2025 · Citations: 0
Pairwise Preference
In our simulation environment, a user agent with a rich persona interacts with the assistant, providing ratings on how well each suggestion aligns with its preferences and context.
- ReviewScore: Misinformed Peer Review Detection with Large Language Models
Hyun Ryu, Doohyuk Jang, Hyemin S. Lee, Joonhyun Jeong, Gyeongman Kim · Sep 25, 2025 · Citations: 0
We build a human expert-annotated ReviewScore dataset to check the ability of LLMs to automate ReviewScore evaluation.
- Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices
Yilong Li, Shuai Zhang, Yijing Zeng, Hao Zhang, Xinmiao Xiong · Sep 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Quokka: Accelerating Program Verification with LLMs via Invariant Synthesis
Anjiang Wei, Tianran Sun, Tarun Suresh, Haoze Wu, Ke Wang · Sep 25, 2025 · Citations: 0
We introduce Quokka, an evaluation-oriented framework for LLM-based invariant synthesis that provides sound evaluation and achieves state-of-the-art performance.
- AutoClimDS: Climate Data Science Agentic AI -- A Knowledge Graph is All You Need
Ahmed Jaber, Wangshu Zhu, Ayon Roy, Karthick Jayavelu, Justin Downes · Sep 25, 2025 · Citations: 0
- Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training
Junkai Zhang, Zihao Wang, Lin Gui, Swarnashree Mysore Sathyendra, Jaehwan Jeong · Sep 25, 2025 · Citations: 0
Rubric Rating
Reinforcement fine-tuning (RFT) often suffers from reward over-optimization, where a policy model hacks the reward signals to achieve high scores while producing low-quality outputs.
- UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic Languages
Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay · Sep 25, 2025 · Citations: 0
Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality.