- SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng · Jun 1, 2025 · Citations: 0
We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results.
- DeepQuestion: Systematic Generation of Real-World Challenges for Evaluating LLMs Performance
Ali Khoramfar, Ali Ramezani, Mohammad Mahdi Mohajeri, Mohammad Javad Dousti, Majid Nili Ahmadabadi · May 30, 2025 · Citations: 0
- When Large Multimodal Models Confront Evolving Knowledge: Challenges and Explorations
Kailin Jiang, Yuntao Du, Yukai Ding, Yuchen Ren, Ning Jiang · May 30, 2025 · Citations: 0
To address this, we first propose a pipeline to construct MMEVOKE, a benchmark for evaluating LMMs' ability in multimodal evolving knowledge injection.
- Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation
Arthur S. Bianchessi, Yasmin C. Aguirre, Rodrigo C. Barros, Lucas S. Kupssinskü · May 28, 2025 · Citations: 0
However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims.
- Counting trees: A treebank-driven exploration of syntactic variation in speech and writing across languages
Kaja Dobrovoljc · May 28, 2025 · Citations: 0
Pairwise Preference
Results show that, across both languages, spoken corpora contain fewer and less diverse syntactic structures than their written counterparts, with consistent cross-linguistic preferences for certain structural types across modalities.
- Measuring Sycophancy of Language Models in Multi-turn Dialogues
Jiseung Hong, Grace Byun, Seungone Kim, Kai Shu, Jinho D. Choi · May 28, 2025 · Citations: 0
- RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments
Zeyi Liao, Jaylen Jones, Linxi Jiang, Yuting Ning, Eric Fosler-Lussier · May 28, 2025 · Citations: 0
Red Team Web Browsing
Using RedTeamCUA, we develop RTC-Bench, a comprehensive benchmark with 864 examples that investigate realistic, hybrid web-OS attack scenarios and fundamental security vulnerabilities.
- VeriTrail: Closed-Domain Hallucination Detection with Traceability
Dasha Metropolitansky, Jonathan Larson · May 27, 2025 · Citations: 0
- R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang · May 27, 2025 · Citations: 0
- RefTool: Reference-Guided Tool Creation for Knowledge-Intensive Reasoning
Xiao Liu, Da Yin, Zirui Wu, Yansong Feng · May 27, 2025 · Citations: 0
- Augmenting Research Ideation with Data: An Empirical Investigation in Social Science
Xiao Liu, Xinyi Dong, Xinyang Gao, Yansong Feng, Xun Pang · May 27, 2025 · Citations: 0
- RPM: Reasoning-Level Personalization for Black-Box Large Language Models
Jieyong Kim, Tongyoung Kim, Soojin Yoon, Jaehyung Kim, Dongha Lee · May 27, 2025 · Citations: 0
- Tracing and Reversing Edits in LLMs
Paul Youssef, Zhixue Zhao, Christin Seifert, Jörg Schlötterer · May 27, 2025 · Citations: 0
- PonderLM: Pretraining Language Models to Ponder in Continuous Space
Boyi Zeng, Shixiang Song, Siyuan Huang, Yixuan Wang, He Li · May 27, 2025 · Citations: 0
Humans ponder before articulating complex sentence elements, enabling deeper cognitive processing through focused effort.
- FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information
Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang · May 27, 2025 · Citations: 0
Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents.
- Characterizing Pattern Matching and Its Limits on Compositional Task Structures
Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko · May 26, 2025 · Citations: 0
- REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jun Rao, Min Zhang · May 26, 2025 · Citations: 0
Critique Edit
To address these issues, we propose REA-RL, which introduces a small reflection model for efficient scaling in online training, offering both parallel sampling and sequential revision.
- Types of Relations: Defining Analogies with Category Theory
Claire Ott, Frank Jäkel · May 26, 2025 · Citations: 0
In order to behave intelligently both humans and machines have to represent their knowledge adequately for how it is used.