- Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments
Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake · May 25, 2025 · Citations: 0
- LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models
Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen · May 25, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Do LLMs have a Gender (Entropy) Bias?
Sonal Prabhune, Balaji Padmanabhan, Kaushik Dutta · May 24, 2025 · Citations: 0
We investigate the existence and persistence of a specific type of gender bias in some of the popular LLMs and contribute a new benchmark dataset, RealWorldQuestioning (released on HuggingFace ), developed from real-world questions across…
- Disentangling Knowledge Representations for Large Language Model Editing
Mengqi Zhang, Zisheng Zhou, Xiaotian Ye, Qiang Liu, Zhaochun Ren · May 24, 2025 · Citations: 0
To rigorously evaluate fine-grained irrelevant knowledge preservation, we construct FINE-KED, a new benchmark comprising fine-grained irrelevant knowledge at different levels of relational similarity to the edited knowledge.
- ReasonMap: Towards Fine-Grained Visual Reasoning from Transit Maps
Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song · May 24, 2025 · Citations: 0
To bridge this gap, we introduce ReasonMap, a novel benchmark specifically designed to evaluate these capabilities.
- SEW: Self-Evolving Agentic Workflows for Automated Code Generation
Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng · May 24, 2025 · Citations: 0
- Knowledge Fusion of Large Language Models Via Modular SkillPacks
Guodong Du, Zhuo Li, Xuanning Zhou, Junlin Li, Zesheng Shi · May 24, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling
Jarrod Ragsdale, Rajendra Boppana · May 23, 2025 · Citations: 0
- BiomedSQL: Text-to-SQL for Scientific Reasoning on Biomedical Knowledge Bases
Mathew J. Koretsky, Maya Willey, Owen Bianchi, Chelsea X. Alvarado, Tanay Nayak · May 23, 2025 · Citations: 0
Long Horizon
We introduce BiomedSQL, the first benchmark explicitly designed to evaluate scientific reasoning in text-to-SQL generation over a real-world biomedical knowledge base.
- One RL to See Them All: Visual Triple Unified Reinforcement Learning
Yan Ma, Linge Du, Xuyang Shen, Shaoxiang Chen, Pengfei Li · May 23, 2025 · Citations: 0
The final Orsta models improve over their backbones on MEGA-Bench, compare favorably with strong multi-task RL-VLM baselines, and transfer these gains to a broad set of downstream benchmarks.
- Training with Pseudo-Code for Instruction Following
Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor · May 23, 2025 · Citations: 0
Demonstrations
We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models.
- Just as Humans Need Vaccines, So Do Models: Model Immunization to Combat Falsehoods
Shaina Raza, Rizwan Qureshi, Azib Farooq, Marcelo Lotif, Aman Chadha · May 23, 2025 · Citations: 0
Pairwise Preference
Unlike post-hoc filtering or preference-based alignment, immunization introduces direct negative supervision on labeled falsehoods.
- Two-Stage Regularization-Based Structured Pruning for LLMs
Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin · May 23, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Evaluation Faking: Unveiling Observer Effects in Safety Evaluation of Frontier AI Systems
Yihe Fan, Wenqi Zhang, Xudong Pan, Min Yang · May 23, 2025 · Citations: 0
- HoloLLM: Multisensory Foundation Model for Language-Grounded Human Sensing and Reasoning
Chuhao Zhou, Jianfei Yang · May 23, 2025 · Citations: 0
In this paper, we introduce HoloLLM, a Multimodal Large Language Model (MLLM) that integrates uncommon but powerful sensing modalities, such as LiDAR, infrared, mmWave radar, and WiFi, to enable seamless human perception and reasoning…
- On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning
Yifan Zhang, Yifeng Liu, Huizhuo Yuan, Yang Yuan, Quanquan Gu · May 23, 2025 · Citations: 0
On mathematical reasoning benchmarks (AIME24, AIME25), RPG-REINFORCE with RPG-Style Clip improves accuracy by up to +6 absolute percentage points over DAPO.
- Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang, Mingyang Wang, Yihong Liu, Hinrich Schütze, Barbara Plank · May 22, 2025 · Citations: 0
Red Team
Refusal mechanisms in large language models (LLMs) are essential for ensuring safety.
- Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning
Adnan Oomerjee, Zafeirios Fountas, Haitham Bou-Ammar, Jun Wang · May 22, 2025 · Citations: 0
- Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin · May 22, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Hiding in Plain Sight: A Steganographic Approach to Stealthy LLM Jailbreaks
Jianing Geng, Biao Yi, Zekun Fei, Ruiqi He, Lihai Nie · May 22, 2025 · Citations: 0
- AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng · May 22, 2025 · Citations: 0
The rapid development and widespread adoption of Audio Large Language Models (ALLMs) demand rigorous evaluation of their trustworthiness.
- Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task
Mengyang Qiu, Zoe Brisebois, Siena Sun · May 22, 2025 · Citations: 0
Pairwise Preference
Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear.
- Dynamic Token Reweighting for Robust Vision-Language Models
Tanqiu Jiang, Jiacheng Liang, Rongyi Zhu, Jiawei Zhou, Fenglong Ma · May 22, 2025 · Citations: 0
Red Team
Large vision-language models (VLMs) are highly vulnerable to multimodal jailbreak attacks that exploit visual-textual interactions to bypass safety guardrails.
- Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy, Mostafa Elhoushi, Amr Alanwar · May 22, 2025 · Citations: 0
Red Team
Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning.
- Efficient PRM Training Data Synthesis via Formal Verification
Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang · May 21, 2025 · Citations: 0
However, existing approaches for constructing PRM training data remain costly and noisy, as they typically rely on human annotation or sampling-based labeling methods that require repeated LLM calls.
- VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
Yuchen Yan, Jin Jiang, Zhenbang Ren, Yijun Li, Xudong Cai · May 21, 2025 · Citations: 0
Pairwise Preference
In this paper, we introduce VerifyBench and its challenging variant VerifyBench-Hard, two benchmarks specifically designed to assess reference-based reward systems.
- Reward Is Enough: LLMs Are In-Context Reinforcement Learners
Kefan Song, Amir Moeini, Peng Wang, Lei Gong, Rohan Chandra · May 21, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases
Pingqing Zheng, Jiayin Qin, Fuqi Zhang, Niraj Chitla, Zishen Wan · May 21, 2025 · Citations: 0
- Explainable embeddings with Distance Explainer
Christiaan Meijer, E. G. Patrick Bos · May 21, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- ALIEN: Aligned Entropy Head for Improving Uncertainty Estimation of LLMs
Artem Zabolotnyi, Roman Makarov, Mile Mitrovic, Polina Proskura, Oleg Travkin · May 21, 2025 · Citations: 0
Experiments across seven classification datasets and two NER benchmarks, evaluated on five language models (RoBERTa, ELECTRA, LLaMA-2, Qwen2.5, and Qwen3), show that ALIEN consistently outperforms strong baselines across all considered…
- Guided Policy Optimization under Partial Observability
Yueheng Li, Guangming Xie, Zongqing Lu · May 21, 2025 · Citations: 0
- Understanding the Anchoring Effect of LLM with Synthetic Data: Existence, Mechanism, and Potential Mitigations
Yiming Huang, Biquan Bie, Zuqiu Na, Weilin Ruan, Songxin Lei · May 21, 2025 · Citations: 0
Combining refined evaluation metrics, we benchmark current widely used LLMs.
- A quantitative analysis of semantic information in deep representations of text and images
Santiago Acevedo, Andrea Mascaretti, Riccardo Rende, Matéo Mahaut, Marco Baroni · May 21, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SAKE: Structured Agentic Knowledge Extrapolation for Complex LLM Reasoning via Reinforcement Learning
Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth · May 21, 2025 · Citations: 0
Long Horizon
We propose SAKE (Structured Agentic Knowledge Extrapolation), a RL powered agentic framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge through tool-augmented reinforcement learning.
- MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation
Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo · May 21, 2025 · Citations: 0
- Entailed Opinion Matters: Improving the Fact-Checking Performance of Language Models by Relying on their Entailment Ability
Gaurav Kumar, Ayush Garg, Debajyoti Mazumder, Aditya Kishore, Babu kumar · May 21, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision
Zixuan Ke, Austin Xu, Yifei Ming, Xuan-Phi Nguyen, Ryan Chin · May 21, 2025 · Citations: 0
Critique Edit Multi Agent
Multi-agent systems (MAS) leveraging the impressive capabilities of Large Language Models (LLMs) hold significant potential for tackling complex tasks.
- REFLEX: Metacognitive Reasoning for Reflective Zero-Shot Robotic Planning with Large Language Models
Wenjie Lin, Jin Wei-Kocsis, Jiansong Zhang, Byung-Cheol Min, Dongming Gan · May 20, 2025 · Citations: 0
Demonstrations
Inspired by human metacognitive learning and creative problem-solving, we address this limitation by exploring a fundamental question: Can LLMs be empowered with metacognitive capabilities to reason, reflect, and create, thereby enhancing…
- Language Models use Lookbacks to Track Beliefs
Nikhil Prakash, Natalie Shapira, Arnab Sen Sharma, Christoph Riedl, Yonatan Belinkov · May 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Language Models
Xiaojie Gu, Ziying Huang, Jia-Chen Gu, Kai Zhang · May 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Neuro-Symbolic Approach for Reliable Proof Generation with LLMs: A Case Study in Euclidean Geometry
Oren Sultan, Eitan Stern, Dafna Shahaf · May 20, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding
Jiaang Li, Yifei Yuan, Wenyan Li, Mohammad Aliannejadi, Daniel Hershcovich · May 20, 2025 · Citations: 0
To bridge this gap, we introduce RAVENEA (Retrieval-Augmented Visual culturE uNdErstAnding), a new benchmark designed to advance visual culture understanding through retrieval, focusing on two tasks: culture-focused visual question…
- AAPO: Enhancing the Reasoning Capabilities of LLMs with Advantage Margin
Jian Xiong, Jingbo Zhou, Jingyong Ye, Qiang Huang, Dejing Dou · May 20, 2025 · Citations: 0
- Phonetic Perturbations Reveal Tokenizer-Rooted Safety Gaps in LLMs
Darpan Aswal, Siddharth D Jaiswal · May 20, 2025 · Citations: 0
Red Team
Safety-aligned LLMs remain vulnerable to digital phenomena like textese that introduce non-canonical perturbations to words but preserve the phonetics.
- MSDformer: Multi-scale Discrete Transformer For Time Series Generation
Shibo Feng, Zhicheng Chen, Xi Xiao, Zhong Zhang, Qing Li · May 20, 2025 · Citations: 0
- Integration of TinyML and LargeML: A Survey of 6G and Beyond
Thai-Hoc Vu, Ngo Hoang Tu, Thien Huynh-The, Kyungchun Lee, Sunghwan Kim · May 20, 2025 · Citations: 0
- Not Minds, but Signs: Reframing LLMs through Semiotics
Davide Picca · May 20, 2025 · Citations: 0
Rather than assuming that LLMs understand language or simulate human thought, we propose that their primary function is to recombine, recontextualize, and circulate linguistic forms based on probabilistic associations.
- Word length predicts word order: "Min-max"-ing drives language evolution
Hiram Ring · May 20, 2025 · Citations: 0
This paper proposes a general universal explanation for word order change based on a theory of communicative interaction (the Min-Max theory of language behavior) in which agents seek to minimize effort while maximizing information.
- Efficient Agent Training for Computer Use
Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0
Demonstrations Long Horizon
We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
- Let's Verify Math Questions Step by Step
Chengyu Shen, Zhen Hao Wong, Runming He, Hao Liang, Meiyi Qiang · May 20, 2025 · Citations: 0
In this work, we present ValiMath, a benchmark consisting of 2147 human-verified mathematical questions covering a wide range of domains such as arithmetic, algebra, and geometry, which are synthesized and curated from the NuminaMath…
- Structured Agent Distillation for Large Language Model
Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li · May 20, 2025 · Citations: 0
Demonstrations
Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.
- Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian · May 19, 2025 · Citations: 0
Rubric Rating
Current benchmarks usually involve simplified tasks.
- Advancing Software Quality: A Standards-Focused Review of LLM-Based Assurance Techniques
Avinash Patil · May 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Reality Check of Language Models as Formalizers on Constraint Satisfaction Problems
Rikhil Amonkar, Ceyhun Efe Kayan, Qimei Lai, Ronan Le Bras, Li Zhang · May 19, 2025 · Citations: 0
We systematically investigate the formalization capability of LLMs on real-life constraint satisfaction problems on 4 benchmarks, 6 LLMs, and 2 types of formal languages.
- What if Deception Cannot be Detected? A Cross-Linguistic Study on the Limits of Deception Detection from Text
Aswathy Velutharambath, Kai Sassenberg, Roman Klinger · May 19, 2025 · Citations: 0
We further benchmark against other English deception datasets following similar data collection protocols.
- Iterative Formalization and Planning in Partially Observable Environments
Liancheng Gong, Wang Zhu, Jesse Thomason, Li Zhang · May 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference
Guangda Liu, Chengwei Li, Zhenyu Ning, Jing Lin, Yiwu Yao · May 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Complexity counts: global and local perspectives on Indo-Aryan numeral systems
Chundra Cathcart · May 19, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan, Jingwei Ni, Jakob Merane, Yang Tian, Yoan Hermstrüwer · May 19, 2025 · Citations: 0
Long Horizon
To address this, we introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels.