- When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025
To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
- Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025
Pairwise Preference
Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025
Red Team Web Browsing
Computer-use agents (CUAs) promise to automate complex tasks across operating systems (OS) and the web, but remain vulnerable to indirect prompt injection.
- PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
- FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025
Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
- Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025
Cross-capability transfer is a key challenge in large language model (LLM) research, with applications in multi-task integration, model compression, and continual learning.
- HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025
Embodied agents operating in smart homes must understand human behavior through diverse sensory inputs and communicate via natural language.
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025
On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to $+6$ absolute percentage points over DAPO.
- Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025
Red Team
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
- Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.
- Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025
Pairwise Preference
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025
Pairwise Preference
However, existing reward benchmarks focus on preference comparisons between responses rather than evaluating verification against ground truth references, leaving a critical gap in our ability to evaluate verification systems used in reason
- Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025
Automated fact-checking has been a challenging task for the research community.
- Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025
How do language models (LMs) represent characters' beliefs, especially when those beliefs may differ from reality?
- Towards Reliable Proof Generation with LLMs: A Neuro-Symbolic Approach
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025
Large language models (LLMs) struggle with formal domains that require rigorous logical deduction and symbolic reasoning, such as mathematical proof generation.
- What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025
We further benchmark against other English deception datasets following similar data collection protocols.
- Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart · May 19, 2025
The numeral systems of Indo-Aryan languages such as Hindi, Gujarati, and Bengali are highly unusual in that unlike most numeral systems (e.g., those of English, Chinese, etc.), forms referring to 1--99 are highly non-transparent and are can
- BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs
Junxiao Yang, Jinzhe Tu, Haoran Liu, Xiaoce Wang, Chujie Zheng · May 18, 2025
Recent advances in Large Reasoning Models (LRMs) have shown impressive capabilities in mathematical and logical reasoning.
- EAMET: Robust Massive Model Editing via Embedding Alignment Optimization
Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang · May 17, 2025
Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs).
- Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang · May 16, 2025
Web Browsing
In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions.
- CodePDE: An Inference Framework for LLM-driven PDE Solver Generation
Shanda Li, Tanya Marwah, Junhong Shen, Weiwei Sun, Andrej Risteski · May 13, 2025
With CodePDE, we present a thorough evaluation on critical capacities of LLM for PDE solving: reasoning, debugging, self-refinement, and test-time scaling.
- Multimodal Integrated Knowledge Transfer to Large Language Models through Preference Optimization with Biomedical Applications
Zhanliang Wang, Da Wu, Quan Nguyen, Zhuoran Xu, Kai Wang · May 9, 2025
Pairwise Preference
To address this challenge, we introduce MINT (Multimodal Integrated kNowledge Transfer), a framework that aligns unimodal large decoder models with domain-specific decision patterns from multimodal biomedical data through preference optimiz
- ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization
Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis · May 5, 2025
Applied to several large language models (LLMs), ReplaceMe achieves up to 25\% pruning while retaining approximately 90\% of the original model's performance on open benchmarks - without any training or healing steps, resulting in minimal c
- Large Language Model Compression with Global Rank and Sparsity Optimization
Changhai Zhou, Qian Qiao, Yuhua Zhou, Yuxin Wu, Shichao Weng · May 2, 2025
Low-rank and sparse composite approximation is a natural idea to compress Large Language Models (LLMs).