- SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng · Jun 1, 2025 · Citations: 0
We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results.
- DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving
Dawood Wasif, Terrence J. Moore, Chandan K. Reddy, Frederica Free-Nelson, Seunghyun Yoon · Jun 1, 2025 · Citations: 0
- The LLM Bottleneck: Why Open-Source Vision LLMs Struggle with Hierarchical Visual Recognition
Yuwen Tan, Yuan Qing, Boqing Gong · May 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi · May 30, 2025 · Citations: 0
- Online Fair Division with Additional Information
Tzeh Yuan Neoh, Jannik Peters, Nicholas Teh · May 30, 2025 · Citations: 0
We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be allocated irrevocably.
- When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025 · Citations: 0
To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
- SwingArena: Competitive Programming Arena for Long-context GitHub Issue Solving
Wendong Xu, Jing Xiong, Chenyang Zhao, Qiujiang Chen, Haoran Wang · May 29, 2025 · Citations: 0
We present SwingArena, a competitive evaluation framework for Large Language Models (LLMs) that closely mirrors real-world software development workflows.
- Probing Association Biases in LLM Moderation Over-Sensitivity
Yuxin Wang, Botao Yu, Ivory Yang, Saeed Hassanpour, Soroush Vosoughi · May 29, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Formula-R1: Incentivizing LLM Reasoning over Complex Tables with Numerical Computation via Formula-Driven Reinforcement Learning
Lang Cao, Jingxian Xu, Hanbing Liu, Jinyu Wang, Mengyu Zhou · May 29, 2025 · Citations: 0
Long Horizon
We demonstrate the effectiveness of Formula Tuning through extensive experiments on seven table reasoning benchmarks.
- AJF: Adaptive Jailbreak Framework Based on the Comprehension Ability of Black-Box Large Language Models
Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin, Fei Gao · May 29, 2025 · Citations: 0
Red Team
Building on this insight, we propose an Adaptive Jailbreak Framework (AJF) based on the comprehension ability of black-box large language models.
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025 · Citations: 0
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
- Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0
Pairwise Preference
Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
- StressTest: Can YOUR Speech LM Handle the Stress?
Iddo Yosha, Gallil Maimon, Yossi Adi · May 28, 2025 · Citations: 0
Despite the crucial role of sentence stress in shaping meaning and intent, it remains largely overlooked in evaluation and development of SLMs.
- Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi · May 28, 2025 · Citations: 0
- Flying Pigs, FaR and Beyond: Evaluating LLM Reasoning in Counterfactual Worlds
Anish R Joishy, Ishwar B Balappanawar, Vamshi Krishna Bonagiri, Manas Gaur, Krishnaprasad Thirunarayan · May 28, 2025 · Citations: 0
Evaluation of 11 LLMs across six diverse reasoning datasets reveals a consistent failure: model accuracy plummets by an average of 14% in counterfactual scenarios compared to knowledge-aligned ones.
- Mixture-of-Retrieval Experts for Reasoning-Guided Multimodal Knowledge Exploitation
Chunyi Peng, Zhipeng Xu, Zhenghao Liu, Yishan Li, Yukun Yan · May 28, 2025 · Citations: 0
Experimental results on diverse open-domain QA benchmarks demonstrate the effectiveness of MoRE, achieving average performance gains of over 7% compared to competitive baselines.
- Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Tianmai M. Zhang, Neil F. Abernethy · May 28, 2025 · Citations: 0
Expert Verification
However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation.
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0
Red Team Web Browsing
Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
- VeriTrail: Closed-Domain Hallucination Detection with Traceability
Dasha Metropolitansky, Jonathan Larson · May 27, 2025 · Citations: 0
- R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang · May 27, 2025 · Citations: 0
- How Does Alignment Enhance LLMs' Multilingual Capabilities? A Language Neurons Perspective
Shimao Zhang, Zhejian Lai, Xiang Liu, Shuaijie She, Xiao Liu · May 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025 · Citations: 0
Tool Use
Experiments on causality, physics, and chemistry benchmarks demonstrate that RefTool outperforms existing tool-creation and domain-specific reasoning methods by 12.3% on average accuracy, while being cost-efficient and broadly generalizable…
- Augmenting Research Ideation with Data: An Empirical Investigation in Social Science
Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang · May 27, 2025 · Citations: 0
- RPM: Reasoning-Level Personalization for Black-Box Large Language Models
Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee · May 27, 2025 · Citations: 0
Pairwise Preference
While black-box large language models are widely deployed, they produce generic outputs that overlook individual user preferences.
- Generalizable Heuristic Generation Through LLMs with Meta-Optimization
Yiding Shi, Jianan Zhou, Wen Song, Jieyi Bi, Yaoxin Wu · May 27, 2025 · Citations: 0
- Tracing and Reversing Edits in LLMs
Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer · May 27, 2025 · Citations: 0
- Do LLMs Understand Collaborative Signals? Diagnosis and Repair
Shahrooz Pouryousef, Ali Montazeralghaem · May 27, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Wideband RF Radiance Field Modeling Using Frequency-embedded 3D Gaussian Splatting
Zechen Li, Lanqing Yang, Yiheng Bian, Hao Pan, Yongjian Fu · May 27, 2025 · Citations: 0
- PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025 · Citations: 0
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
- FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025 · Citations: 0
Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
- VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen · May 26, 2025 · Citations: 0
The rapid advancement of Large Multimodal Models (LMMs) for 2D images and videos has motivated extending these models to understand 3D scenes, aiming for human-like visual-spatial intelligence.
- Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko · May 26, 2025 · Citations: 0
- Token Distillation: Attention-aware Input Embeddings For New Tokens
Konstantin Dobler, Desmond Elliott, Gerard de Melo · May 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ERC-SVD: Error-Controlled SVD for Large Language Model Compression
Haolei Bai, Siyong Jian, Tuo Liang, Yu Yin, Huan Wang · May 26, 2025 · Citations: 0
- Inference-time Alignment in Continuous Space
Yige Yuan, Teng Xiao, Li Yunfan, Bingbing Xu, Shuchang Tao · May 26, 2025 · Citations: 0
Aligning large language models with human feedback at inference time has received increasing attention due to its flexibility.
- Incentivizing Strong Reasoning from Weak Supervision
Yige Yuan, Teng Xiao, Shuchang Tao, Xue Wang, Jinyang Gao · May 26, 2025 · Citations: 0
Demonstrations
Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks.
- REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang · May 26, 2025 · Citations: 0
Critique Edit
To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision.
- Types of Relations: Defining Analogies with Category Theory
Claire Ott, Frank Jäkel · May 26, 2025 · Citations: 0
In order to behave intelligently both humans and machines have to represent their knowledge adequately for how it is used.
- Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel · May 26, 2025 · Citations: 0
Pairwise Preference
We present a fine-grained theoretical analysis of the performance gap between reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under a representation gap.
- Graceful Forgetting in Generative Language Models
Chunyang Jiang, Chi-min Chan, Yiyang Cai, Yulong Liu, Wei Xue · May 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Your Classifier Can Do More: Towards Balancing the Gaps in Classification, Robustness, and Generation
Kaichao Jiang, He Wang, Xiaoshuai Hao, Xiulong Yang, Ajian Liu · May 26, 2025 · Citations: 0
- Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments
Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake · May 25, 2025 · Citations: 0
- LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen · May 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Do LLMs have a Gender (Entropy) Bias?
Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta · May 24, 2025 · Citations: 0
We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across…
- Disentangling Knowledge Representations for Large Language Model Editing
Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren · May 24, 2025 · Citations: 0
To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge.
- ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song · May 24, 2025 · Citations: 0
To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities.
- Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling
Jarrod Ragsdale, Rajendra Boppana · May 23, 2025 · Citations: 0
- BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak · May 23, 2025 · Citations: 0
Long Horizon
We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
- Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0
Demonstrations
We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
- Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha · May 23, 2025 · Citations: 0
Pairwise Preference
Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods.
- Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang · May 23, 2025 · Citations: 0
- HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025 · Citations: 0
In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning…
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0
On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO.
- Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0
Red Team
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
- Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang · May 22, 2025 · Citations: 0
- Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks
Jianing Geng, Biao Yi, Zekun Fei, Ruiqi He, Lihai Nie · May 22, 2025 · Citations: 0
- AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng · May 22, 2025 · Citations: 0
The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness.