- SECA: Semantically Equivalent and Coherent Attacks for Eliciting LLM Hallucinations
Buyun Liang, Liangzu Peng, Jinqi Luo, Darshan Thaker, Kwan Ho Ryan Chan · Oct 5, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Large Language Models Hallucination: A Comprehensive Survey
Aisha Alansari, Hamzah Luqman · Oct 5, 2025 · Citations: 0
We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations.
- Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation
Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary · Oct 5, 2025 · Citations: 0
Rubric Rating
We present a principled Bayesian evaluation framework that replaces Pass@k and average accuracy over N trials (avg@N) with posterior estimates of a model's underlying success probability and credible intervals, yielding stable rankings and…
- AlphaApollo: A System for Deep Agentic Reasoning
Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li · Oct 5, 2025 · Citations: 0
Tool Use
We present AlphaApollo, an agentic reasoning system that targets two bottlenecks in foundation-model reasoning: (1) limited reasoning capacity for complex, long-horizon problem solving and (2) unreliable test-time evolution without…
- PoLi-RL: A Point-to-List Reinforcement Learning Framework for Conditional Semantic Textual Similarity
Zixin Song, Bowen Zhang, Qian-Wen Zhang, Di Yin, Xing Sun · Oct 5, 2025 · Citations: 0
Pairwise Preference
On the official C-STS benchmark, PoLi-RL achieves a Spearman correlation coefficient of 48.18, establishing a new SOTA for the cross-encoder architecture.
- Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Ziyan Wang, Zheng Wang, Xingwei Qu, Qi Cheng, Jie Fu · Oct 5, 2025 · Citations: 0
Long Horizon
Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks.
- What Scales in Cross-Entropy Scaling Law?
Junxi Yan, Zixi Wei, Qingyao Ai, Yiqun Liu, Jingtao Zhan · Oct 5, 2025 · Citations: 0
- Person-Centric Annotations of LAION-400M: Auditing Bias and Its Transfer to Models
Leander Girrbach, Stephan Alaniz, Genevieve Smith, Trevor Darrell, Zeynep Akata · Oct 4, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning
Wenlong Deng, Yi Ren, Yushu Li, Boying Gong, Danica J. Sutherland · Oct 4, 2025 · Citations: 0
We validate the efficacy of this algorithm on diverse math reasoning benchmarks.
- MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations
Jiang Wu, Sichao Wu, Yinsong Ma, Guangyuan Yu, Haoyuan Xu · Oct 4, 2025 · Citations: 0
- Expressive Power of Implicit Models: Rich Equilibria and Test-Time Scaling
Jialin Liu, Lisang Ding, Stanley Osher, Wotao Yin · Oct 4, 2025 · Citations: 0
- AgentHub: A Registry for Discoverable, Verifiable, and Reproducible AI Agents
Erik Pautsch, Tanmay Singla, Parv Kumar, Wenxin Jiang, Huiyun Peng · Oct 3, 2025 · Citations: 0
LLM-based agents are rapidly proliferating, yet the infrastructure for discovering, evaluating, and governing them remains fragmented compared to mature ecosystems like software package registries (e.g., npm) and model hubs (e.g., Hugging…
- Attention-Aligned Reasoning for Large Language Models
Hongxiang Zhang, Yuan Tian, Tianyi Zhang · Oct 3, 2025 · Citations: 0
Our experiments show that ATAR outperforms SOTA methods across six benchmarks, achieving up to 15.39% absolute improvement.
- Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai · Oct 3, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Markovian Reeb Graphs for Simulating Spatiotemporal Patterns of Life
Anantajit Subrahmanya, Chandrakanth Gudavalli, Connor Levenson, B. S. Manjunath · Oct 3, 2025 · Citations: 0
- Finding Diamonds in Conversation Haystacks: A Benchmark for Conversational Data Retrieval
Yohan Lee, Yongwoo Song, Sangyeop Kim · Oct 3, 2025 · Citations: 0
We present the Conversational Data Retrieval (CDR) benchmark, the first comprehensive test set for evaluating systems that retrieve conversation data for product insights.
- Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis
Harshwardhan Fartale, Ashish Kattamuri, Rahul Raja, Arpita Vats, Ishita Prasad · Oct 3, 2025 · Citations: 0
- Unraveling Syntax: How Language Models Learn Context-Free Grammars
Laura Ying Schulz, Daniel Mitropolsky, Tomaso Poggio · Oct 2, 2025 · Citations: 0
- Tree-based Dialogue Reinforced Policy Optimization for Red-Teaming Attacks
Ruohao Guo, Afshin Oroojlooy, Roshan Sridhar, Miguel Ballesteros, Alan Ritter · Oct 2, 2025 · Citations: 0
- BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals
Chenqi Li, Yu Liu, Timothy Denison, Tingting Zhu · Oct 2, 2025 · Citations: 0
Biosignals offer valuable insights into the physiological states of the human body.
- Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei · Oct 2, 2025 · Citations: 0
To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across…
- ExGRPO: Learning to Reason from Experience
Runzhe Zhan, Yafu Li, Zhi Wang, Xiaoye Qu, Dongrui Liu · Oct 2, 2025 · Citations: 0
- AccurateRAG: A Framework for Building Accurate Retrieval-Augmented Question-Answering Applications
Linh The Nguyen, Chi Tran, Dung Ngoc Nguyen, Van-Cuong Pham, Hoang Ngo · Oct 2, 2025 · Citations: 0
We introduce AccurateRAG -- a novel framework for constructing high-performance question-answering applications based on retrieval-augmented generation (RAG).
- StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets?
Yanxu Chen, Zijun Yao, Yantao Liu, Amy Xin, Jin Ye · Oct 2, 2025 · Citations: 0
Tool Use
Large language models (LLMs) demonstrate strong potential as autonomous agents, with promising capabilities in reasoning, tool use, and sequential decision-making.
- Dynamic Stress Detection: A Study of Temporal Progression Modelling of Stress in Speech
Vishakha Lall, Yisi Liu · Oct 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FOR-Prompting: From Objection to Revision via an Asymmetric Prompting Protocol
He Zhang, Anzhou Zhang, Jian Dai · Oct 2, 2025 · Citations: 0
Pairwise PreferenceCritique Edit
Beyond structured math tasks, FOR-Prompting supports refinement in open-ended and multi-stage tasks: qualitative analysis shows improved exploration, coverage, and specificity, and a blind study of human preferences found that participants…
- VL-KnG: Persistent Spatiotemporal Knowledge Graphs from Egocentric Video for Embodied Scene Understanding
Mohamad Al Mdfaa, Svetlana Lukina, Timur Akhtyamov, Arthur Nigmatzyanov, Dmitrii Nalberskii · Oct 1, 2025 · Citations: 0
- Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort
Xinpeng Wang, Nitish Joshi, Barbara Plank, Rico Angell, He He · Oct 1, 2025 · Citations: 0
To detect implicit reward hacking, we propose TRACE (Truncated Reasoning AUC Evaluation).
- Energy-Regularized Sequential Model Editing on Hyperspheres
Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng · Oct 1, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Where Do Backdoors Live? A Component-Level Analysis of Backdoor Propagation in Speech Language Models
Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal · Oct 1, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs
Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai · Oct 1, 2025 · Citations: 0
Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by +3.30 points and +4.82 points with 1.5B and 7B models, respectively, and exceeds the best prior sample efficient methods by +2.12 points on…
- Hypothesis-Driven Feature Manifold Analysis in LLMs via Supervised Multi-Dimensional Scaling
Federico Tiblias, Irina Bigoulaeva, Jingcheng Niu, Simone Balloccu, Iryna Gurevych · Oct 1, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- On Discovering Algorithms for Adversarial Imitation Learning
Shashank Reddy Chirra, Jayden Teoh, Praveen Paruchuri, Pradeep Varakantham · Oct 1, 2025 · Citations: 0
Demonstrations
RA functions in AIL are typically derived from divergence minimization objectives, relying heavily on human design and ingenuity.
- ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs
Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor · Oct 1, 2025 · Citations: 0
As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical.
- Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese
Jenny Kunz, Iben Nyholm Debess, Annika Simonsen · Oct 1, 2025 · Citations: 0
To address the lack of existing Faroese evaluation resources, we construct two new minimal-pair probing benchmarks, one for linguistic acceptability and one for text comprehension, and complement them with human evaluations conducted by…
- Stochastic Self-Organization in Multi-Agent Systems
Nurbek Tastan, Samuel Horvath, Karthik Nandakumar · Oct 1, 2025 · Citations: 0
Multi Agent
Multi-agent systems (MAS) based on Large Language Models (LLMs) have the potential to solve tasks that are beyond the reach of any single LLM.
- Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector
Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates · Oct 1, 2025 · Citations: 0
MILCO achieves state-of-the-art multilingual and cross-lingual LSR performance, outperforming leading dense, sparse, and multi-vector baselines such as BGE-M3 and Qwen3-Embed on standard multilingual benchmarks, while supporting dynamic…
- Hearing the Order: Investigating Position Bias in Large Audio-Language Models
Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen · Oct 1, 2025 · Citations: 0
We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts.
- When Silence Matters: The Impact of Irrelevant Audio on Text Reasoning in Large Audio-Language Models
Chen-An Li, Tzu-Han Lin, Hung-yi Lee · Oct 1, 2025 · Citations: 0
Across three text-based benchmarks, we find that even non-informative audio reduces accuracy and increases prediction volatility; the severity of interference scales with longer durations, higher amplitudes, and elevated decoding…
- Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao · Oct 1, 2025 · Citations: 0
Long Horizon
To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation.
- TokMem: One-Token Procedural Memory for Large Language Models
Zijun Wu, Yongchang Hao, Lili Mou · Oct 1, 2025 · Citations: 0
Long Horizon
We introduce TokMem, a procedural memory framework that compiles each reusable task procedure into a single trainable memory token.
- Training Large Language Models To Reason In Parallel With Global Forking Tokens
Sheng Jia, Xiao Wang, Shiva Prasad Kasiviswanathan · Oct 1, 2025 · Citations: 0
- PromptLoop: Plug-and-Play Prompt Refinement via Latent Feedback for Diffusion Model Alignment
Suhyeon Lee, Jong Chul Ye · Oct 1, 2025 · Citations: 0
- Towards Self-Evolving Benchmarks: Synthesizing Agent Trajectories via Test-Time Exploration under Validate-by-Reproduce Paradigm
Dadi Guo, Tianyi Zhou, Dongrui Liu, Chen Qian, Qihan Ren · Oct 1, 2025 · Citations: 0
- KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning
Yinyi Luo, Zhexian Zhou, Hao Chen, Kai Qiu, Marios Savvides · Oct 1, 2025 · Citations: 0
However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation.
- BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses
Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley · Sep 30, 2025 · Citations: 0
Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading…
- PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh · Sep 30, 2025 · Citations: 0
Pairwise PreferenceRubric Rating
We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a…
- DRBench: A Realistic Benchmark for Enterprise Deep Research
Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh · Sep 30, 2025 · Citations: 0
Long Horizon
We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings.
- MENLO: From Preferences to Proficiency -- Evaluating and Modeling Native-like Quality Across 47 Languages
Chenxi Whitehouse, Sebastian Ruder, Tony Lin, Oksana Kurylo, Haruka Takagi · Sep 30, 2025 · Citations: 0
Pairwise PreferenceRubric Rating
To address this, we introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience design-inspired mechanisms.
- OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!
Jingdi Lei, Varun Gumma, Rishabh Bhardwaj, Seok Min Lim, Chuan Li · Sep 30, 2025 · Citations: 0
- On Deepfake Voice Detection -- It's All in the Presentation
Héctor Delgado, Giorgio Ramondetti, Emanuele Dalmasso, Gennady Karvitsky, Daniele Colibro · Sep 30, 2025 · Citations: 0
- Your Agent May Misevolve: Emergent Risks in Self-evolving LLM Agents
Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo · Sep 30, 2025 · Citations: 0
Advances in Large Language Models (LLMs) have enabled a new class of self-evolving agents that autonomously improve through interaction with the environment, demonstrating strong capabilities.
- EditReward: A Human-Aligned Reward Model for Instruction-Guided Image Editing
Keming Wu, Sicong Jiang, Max Ku, Ping Nie, Minghao Liu · Sep 30, 2025 · Citations: 0
Pairwise Preference
To address this critical bottleneck, we built EditReward, trained with our new large-scale human preference dataset, meticulously annotated by trained experts following a rigorous protocol containing over 200K preference pairs.
- Latent Thinking Optimization: Your Latent Reasoning Language Model Secretly Encodes Reward Signals in Its Latent Thoughts
Hanwen Du, Yuxin Dong, Xia Ning · Sep 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation
Edoardo Bianchi, Jacopo Staiano, Antonio Liotta · Sep 30, 2025 · Citations: 0
Critique Edit
ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores.
- SeMoBridge: Semantic Modality Bridge for Efficient Few-Shot Adaptation of CLIP
Christoph Timmermann, Hyunse Lee, Woojin Lee · Sep 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Bringing Emerging Architectures to Sequence Labeling in NLP
Ana Ezquerro, Carlos Gómez-Rodríguez, David Vilares · Sep 30, 2025 · Citations: 0
We study how these architectures adapt across tagging tasks that vary in structural complexity, label space, and token dependencies, with evaluation spanning multiple languages.
- Vector sketch animation generation with differentiable motion trajectories
Xinding Zhu, Xinye Yang, Shuyang Zheng, Zhexin Zhang, Fei Gao · Sep 30, 2025 · Citations: 0
- Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang · Sep 30, 2025 · Citations: 0
Long Horizon
Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance.
- v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
Zhengpeng Shi, Yanpeng Zhao, Jianqun Zhou, Yuxuan Wang, Qinrong Cui · Sep 30, 2025 · Citations: 0
AI models capable of comprehending humor hold real-world promise -- for example, enhancing engagement in human-machine interactions.