- BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley · Sep 30, 2025 · Citations: 0
Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading…
- PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0
Pairwise PreferenceRubric Rating
We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
- DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh · Sep 30, 2025 · Citations: 0
Long Horizon
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.
- MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0
Pairwise PreferenceRubric Rating
To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
- OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Jingdi Lei, Varun Gumma, Rishabh Bhardwaj, Seok Min Lim, Chuan Li · Sep 30, 2025 · Citations: 0
- On Deepfake Voice Detection -- It's All in the Presentation
Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro · Sep 30, 2025 · Citations: 0
- Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo · Sep 30, 2025 · Citations: 0
Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities.
- EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu · Sep 30, 2025 · Citations: 0
Pairwise Preference
To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.
- Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta · Sep 30, 2025 · Citations: 0
Critique Edit
ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores.
- SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Christoph Timmermann, Hyunse Lee, Woojin Lee · Sep 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares · Sep 30, 2025 · Citations: 0
We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages.
- Vector sketch animation generation with differentiable motion trajectories
Xinding Zhu, Xinye Yang, Shuyang Zheng, Zhexin Zhang, Fei Gao · Sep 30, 2025 · Citations: 0
- Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang · Sep 30, 2025 · Citations: 0
Long Horizon
Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance.
- v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui · Sep 30, 2025 · Citations: 0
AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions.
- LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts
Yuan Zhuang, Yi Shen, Yuexin Bian, Qing Su, Shihao Ji · Sep 30, 2025 · Citations: 0
Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks.
- Calibrating Verbalized Confidence with Self-Generated Distractors
Victor Wang, Elias Stengel-Eskin · Sep 29, 2025 · Citations: 0
- The Rise of AfricaNLP: A Survey of Contributions, Contributors, Community Impact, and Bibliometric Analysis
Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse · Sep 29, 2025 · Citations: 0
We quantitatively examine two decades (2005 - 2025) of contributions to AfricaNLP research, using a dataset of 2.2K NLP papers, 4.9K contributing authors, and 7.8K human-annotated contribution sentences (AfricaNLPContributions), along with…
- Polychromic Objectives for Reinforcement Learning
Jubayer Ibn Hamid, Ifdita Hasan Orney, Ellen Xu, Chelsea Finn, Dorsa Sadigh · Sep 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs
Shane Bergsma, Nolan Dey, Joel Hestness · Sep 29, 2025 · Citations: 0
We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*.
- Generative Value Conflicts Reveal LLM Priorities
Andy Liu, Kshitish Ghate, Mona Diab, Daniel Fried, Atoosa Kasirzadeh · Sep 29, 2025 · Citations: 0
Comparing results between multiple-choice and open-ended evaluations, we find that models shift away from supporting protective values, such as harmlessness, and toward supporting personal values, such as user autonomy, in more open-ended…
- Pretraining with hierarchical memories: separating long-tail and common knowledge
Hadi Pouransari, David Grangier, C Thomas, Michael Kirchhof, Oncel Tuzel · Sep 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Incentive-Aligned Multi-Source LLM Summaries
Yanchen Jiang, Zhe Feng, Aranyak Mehta · Sep 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- EasySteer: A Unified Framework for High-Performance and Extensible LLM Steering
Haolei Xu, Xinyu Mei, Yuchen Yan, Rui Zhou, Wenqi Zhang · Sep 29, 2025 · Citations: 0
Demonstrations
We present EasySteer, a unified framework for high-performance, extensible LLM steering built on vLLM.
- Pretraining Large Language Models with NVFP4
NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben · Sep 29, 2025 · Citations: 0
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang · Sep 29, 2025 · Citations: 0
- Dive into the Agent Matrix: A Realistic Evaluation of Self-Replication Risk in LLM Agents
Boxuan Zhang, Yi Yu, Jiaxuan Guo, Jing Shao · Sep 29, 2025 · Citations: 0
The prevalent deployment of Large Language Model agents such as OpenClaw unlocks potential in real-world applications, while amplifying safety concerns.
- Towards Personalized Deep Research: Benchmarks and Evaluations
Yuan Liang, Jiaxian Li, Yuqing Wang, Piaohong Wang, Motong Tian · Sep 29, 2025 · Citations: 0
- Scaling with Collapse: Efficient and Predictable Training of LLM Families
Shane Bergsma, Bin Claire Zhang, Nolan Dey, Shaheer Muhammad, Gurpreet Gosal · Sep 29, 2025 · Citations: 0
- Scaling Generalist Data-Analytic Agents
Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang · Sep 29, 2025 · Citations: 0
Long Horizon
Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI.
- Ultra-Fast Language Generation via Discrete Diffusion Divergence Instruct
Haoyang Zheng, Xinyang Liu, Cindy Xiangrui Kong, Nan Jiang, Zheyuan Hu · Sep 29, 2025 · Citations: 0
On the OpenWebText benchmark, DiDi-Instruct achieves perplexity ranging from 62.2 (8 NFEs) to 18.4 (128 NFEs), outperforming prior accelerated dLLMs and the GPT-2 baseline.
- Agentic Exploration of Physics Models
Maximilian Nägele, Florian Marquardt · Sep 29, 2025 · Citations: 0
Here, we introduce SciExplorer, an agent that leverages large language model tool-use capabilities to enable exploration of systems without any domain-specific blueprints, and apply it to physical systems that are initially unknown to the…
- MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes
Changsheng Zhao, Ernie Chang, Zechun Liu, Chia-Jung Chang, Wei Wen · Sep 29, 2025 · Citations: 0
- Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball · Sep 29, 2025 · Citations: 0
Rubric Rating
Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards.
- TimeOmni-1: Incentivizing Complex Reasoning with Time Series in Large Language Models
Tong Guan, Zijie Meng, Dianqi Li, Shiyu Wang, Chao-Han Huck Yang · Sep 29, 2025 · Citations: 0
TSR-Suite is the first comprehensive time series reasoning suite that supports not only thorough evaluation but also the data pipeline and training of TSRMs.
- VSSFlow: Unifying Video-conditioned Sound and Speech Generation via Joint Learning
Xin Cheng, Yuyue Wang, Xihua Wang, Yihan Wu, Kaisi Guan · Sep 29, 2025 · Citations: 0
Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative…
- ProxyAttn: Guided Sparse Attention via Representative Heads
Yixuan Wang, Huang He, Siqi Bao, Hua Wu, Haifeng Wang · Sep 29, 2025 · Citations: 0
By combining the scores from representative proxy heads with multi-head dynamic budgets, we achieve a more fine-grained block importance evaluation at low computational cost.
- Stop Before You Fail: Operational Capability Boundaries for Mitigating Unproductive Reasoning in Large Reasoning Models
Qingjie Zhang, Yujia Fu, Yang Wang, Liu Yan, Tao Wei · Sep 29, 2025 · Citations: 0
In white-box settings, we show that the hidden states of the last input token contain information that is predictive of whether a question will not be solved correctly under our evaluation setup.
- Inducing Dyslexia in Vision Language Models
Melika Honarmand, Ayati Sharma, Badr AlKhamissi, Johannes Mehrer, Martin Schrimpf · Sep 29, 2025 · Citations: 0
Using stimuli from cognitive neuroscience, we identify visual-word-form-selective units within VLMs and demonstrate that they predict human VWFA neural responses.
- Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings
Hamna Hamna, Gayatri Bhat, Sourabrata Mukherjee, Faisal Lalani, Evan Hadfield · Sep 29, 2025 · Citations: 0
Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users.
- SUIT: Knowledge Editing with Subspace-Aware Key-Value Mappings
Haewon Park, Sangwoo Kim, Yohan Jo · Sep 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention
Yichi Zhang, Yue Ding, Jingwen Yang, Tianwei Luo, Dongbai Li · Sep 29, 2025 · Citations: 0
Pairwise PreferenceRed Team
Motivated by these, we propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong…
- HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
Langqi Yang, Tianhang Zheng, Yixuan Chen, Kedong Xiu, Hao Zhou · Sep 29, 2025 · Citations: 0
To address this gap, we present HarmMetric Eval, a systematic benchmark for assessing the quality of harmfulness metrics and judges with varying formats and scales.
- DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models
Zherui Li, Zheng Nie, Zhenhong Zhou, Yue Liu, Yitong Zhang · Sep 29, 2025 · Citations: 0
Red Team
Experimental results reveal a harmful bias inherent in the standard greedy remasking strategy and identify a critical phenomenon we term Denoising-path Dependence, where the safety of early-stage tokens decisively influences the final…
- SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents
Gyuhyeon Seo, Jungwoo Yang, Junseong Pyo, Nalim Kim, Jonggeun Lee · Sep 29, 2025 · Citations: 0
We introduce SimuHome, a high-fidelity smart home simulator and a benchmark of 600 episodes for LLM-based smart home agents.
- G-reasoner: Foundation Models for Unified Reasoning over Graph-structured Knowledge
Linhao Luo, Zicheng Zhao, Junnan Liu, Zhangchi Qiu, Junnan Dong · Sep 29, 2025 · Citations: 0
- Prompt and Parameter Co-Optimization for Large Language Models
Xiaohe Bo, Rui Li, Zexu Sun, Quanyu Dai, Zeyu Zhang · Sep 29, 2025 · Citations: 0
Extensive experiments across diverse benchmarks show that our method consistently outperforms the baselines.
- BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre · Sep 29, 2025 · Citations: 0
- Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models
Yuhui Wang, Changjiang Li, Guangke Chen, Jiacheng Liang, Ting Wang · Sep 29, 2025 · Citations: 0
- Pragmatic Inference for Moral Reasoning Acquisition: Generalization via Metapragmatic Links
Guangliang Liu, Xi Chen, Bocheng Chen, Xitong Zhang, Kristen Johnson · Sep 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Uncovering Grounding IDs: How External Cues Shape Multimodal Binding
Hosein Hasani, Amirmohammad Izadi, Fatemeh Askari, Mobin Bagherian, Sadegh Mohammadian · Sep 28, 2025 · Citations: 0
Large vision-language models (LVLMs) show strong performance across multimodal benchmarks but remain limited in structured reasoning and precise grounding.
- VoiceBridge: General Speech Restoration with One-step Latent Bridge Models
Chi Zhang, Kaiwen Zheng, Zehua Chen, Jun Zhu · Sep 28, 2025 · Citations: 0
- SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models
Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan · Sep 28, 2025 · Citations: 0
This gap arises not only from the intrinsic difficulty of processing long texts, but also from the scarcity of reliable human annotations and programmatically verifiable reward signals.
- From What to Why: A Multi-Agent System for Evidence-based Chemical Reaction Condition Reasoning
Cheng Yang, Jiaxuan Lu, Haiyuan Wan, Junchi Yu, Feiwei Qin · Sep 28, 2025 · Citations: 0
Multi Agent
In this work, we propose ChemMAS, a multi-agent system that reframes condition prediction as an evidence-based reasoning task.
- Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning
Yucheng Wang, Yifan Hou, Aydin Javadov, Mubashara Akhtar, Mrinmaya Sachan · Sep 28, 2025 · Citations: 0
Pairwise Preference
These inconsistencies stem from a lack of controlled evaluation frameworks and analysis of models' internals to isolate when and why modality interactions support or undermine reasoning.
- M3DLayout: A Multi-Source Dataset of 3D Indoor Layouts and Structured Descriptions for 3D Generation
Yiheng Zhang, Zhuojiang Cai, Mingdao Wang, Meitong Guo, Tianxiao Li · Sep 28, 2025 · Citations: 0
- AudioMoG: Guiding Audio Generation with Mixture-of-Guidance
Junyou Wang, Zehua Chen, Binjie Yuan, Kaiwen Zheng, Chang Li · Sep 28, 2025 · Citations: 0
- Characteristic Root Analysis and Regularization for Linear Time Series Forecasting
Zheng Wang, Kaixuan Zhang, Wanfang Chen, Xiaonan Lu, Longyuan Li · Sep 28, 2025 · Citations: 0
Extensive experiments on standard benchmarks demonstrate the effectiveness of both approaches, validating our theoretical insights and achieving state-of-the-art results in several settings.
- Internal Planning in Language Models: Characterizing Horizon and Branch Awareness
Muhammed Ustaomeroglu, Baris Askin, Gauri Joshi, Carlee Joe-Wong, Guannan Qu · Sep 28, 2025 · Citations: 0
Long Horizon
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra · Sep 27, 2025 · Citations: 0
However, the nature of and interaction between these dependencies within current benchmark evaluations remains poorly characterized.