- BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025
Pairwise Preference Long Horizon
Recent advances in Vision-Language Models (VLMs) have propelled embodied agents by enabling direct perception, reasoning, and planning task-oriented actions from visual inputs.
- When Distributions Shifts: Causal Generalization for Low-Resource Languages
Mahi Aliyu Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye · Oct 31, 2025
Machine learning models often fail under distribution shifts, a problem exacerbated in low-resource settings where limited data restricts robust generalization.
- Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coh
- Probability Distributions Computed by Autoregressive Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski · Oct 31, 2025
Most expressivity results for transformers treat them as language recognizers (which accept or reject strings), and not as they are used in practice, as language models (which generate strings autoregressively and probabilistically).
- Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025
Red Team
Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
- Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon · Oct 30, 2025
Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.
- LLMs Process Lists With General Filter Heads
Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau · Oct 30, 2025
Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming patt
- Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025
Large Language Models (LLMs) face significant inference latency challenges stemming from their autoregressive design and large size.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025
Demonstrations Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025
Large language models (LLMs) are increasingly used as raters for evaluation tasks.
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025
Long Horizon
Real-world language agents must handle complex, multi-step workflows across diverse Apps.
- From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025
Multi Agent
To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
- World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025
Long Horizon
These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
- Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu · Oct 28, 2025
In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages.
- Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025
LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks.
- Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural n
- A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li · Oct 27, 2025
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents, autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks.
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025
Pairwise Preference
Large Language Models (LLMs) are widely used as judges to evaluate response quality, providing a scalable alternative to human evaluation.
- Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
- ATLAS: Adaptive Transfer Scaling Laws for Multilingual Pretraining, Finetuning, and Decoding the Curse of Multilinguality
Shayne Longpre, Sneha Kudugunta, Niklas Muennighoff, I-Hung Hsu, Isaac Caswell · Oct 24, 2025
In this work, we undertake the largest multilingual scaling laws study to date, totaling 774 multilingual training experiments, spanning 10M-8B model parameters, 400+ training languages and 48 evaluation languages.
- PARL: Prompt-based Agents for Reinforcement Learning
Yarik Menchaca Resendiz, Roman Klinger · Oct 24, 2025
However, limited work evaluates LLMs as agents in reinforcement learning (RL) tasks (e.g., playing games), where learning occurs through interaction with an environment and a reward system.
- Estonian Native Large Language Model Benchmark
Helena Grete Lillepalu, Tanel Alumäe · Oct 24, 2025
The availability of LLM benchmarks for the Estonian language is limited, and a comprehensive evaluation comparing the performance of different LLMs on Estonian tasks has yet to be conducted.
- Designing and Evaluating Chain-of-Hints for Scientific Question Answering
Anubhav Jangra, Smaranda Muresan · Oct 24, 2025
Pairwise Preference
Using the best performing LLM as the backbone of a quantitative study with 41 participants, we uncover distinct user preferences across hinting strategies, and identify the limitations of automatic evaluation metrics to capture them.
- RELOOP: Recursive Retrieval with Multi-Hop Reasoner and Planners for Heterogeneous QA
Ruiyi Yang, Hao Xue, Imran Razzak, Hakim Hacid, Flora D. Salim · Oct 23, 2025
Long Horizon
A Head Agent provides guidance that leads retrieval, while an Iteration Agent selects and expands HSeq via structure-respecting actions (e.g., parent/child hops, table row/column neighbors, KG relations); Finally the head agent composes can
- Robust Preference Alignment via Directional Neighborhood Consensus
Ruochen Mao, Yuling Shi, Xiaodong Gu, Jiaheng Wei · Oct 23, 2025
Pairwise Preference
Aligning large language models with human preferences is critical for creating reliable and controllable AI systems.
- CreativityPrism: A Holistic Evaluation Framework for Large Language Model Creativity
Zhaoyi Joey Hou, Bowei Alvin Zhang, Yining Lu, Bhiman Kumar Baghel, Anneliese Brei · Oct 23, 2025
Creativity is often seen as a hallmark of human intelligence.
- A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist
Sohyeon Jeon, Hyung-Chul Lee · Oct 22, 2025
Despite the rapid expansion of Large Language Models (LLMs) in healthcare, robust and explainable evaluation of their ability to assess clinical trial reporting according to CONSORT standards remains an open challenge.
- PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis · Oct 21, 2025
Rubric Rating
While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge.
- MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Chengshu Li, Mengdi Xu, Arpit Bahety, Hang Yin, Yunfan Jiang · Oct 21, 2025
Demonstrations Long Horizon
Imitation learning from large-scale, diverse human demonstrations has been shown to be effective for training robots, but collecting such data is costly and time-consuming.
- Latent-Augmented Discrete Diffusion Models
Dario Shariatian, Alain Durmus, Umut Simsekli, Stefano Peluchetti · Oct 20, 2025
Discrete diffusion models have emerged as a powerful class of models and a promising route to fast language generation, but practical implementations typically rely on factored reverse transitions that ignore cross-token dependencies and de
- SPACeR: Self-Play Anchoring with Centralized Reference Models
Wei-Jer Chang, Akshay Rangesh, Kevin Joseph, Matthew Strong, Masayoshi Tomizuka · Oct 20, 2025
Demonstrations Multi Agent
Developing autonomous vehicles (AVs) requires not only safety and efficiency, but also realistic, human-like behaviors that are socially aware and predictable.
- Annotation-Efficient Universal Honesty Alignment
Shiyu Ni, Keping Bi, Jiafeng Guo, Minghao Tang, Jingtong Wu · Oct 20, 2025
To support a large-scale study, we release HonestyBench, a benchmark covering ten free-form QA datasets with 560k training and 70k evaluation instances annotated with correctness and self-consistency signals.
- FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution
Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni · Oct 18, 2025
Human communication heavily relies on laconism and inferential pragmatics, allowing listeners to successfully reconstruct rich meaning from sparse, telegraphic speech.
- MNO: Multiscale Neural Operator for 3D Computational Fluid Dynamics
Qinxuan Wang, Chuang Wang, Mingyu Zhang, Jingwei Sun, Peipei Yang · Oct 17, 2025
We evaluate MNO on diverse benchmarks, covering steady-state and unsteady flow scenarios with up to 300k points.
- Detecting Early and Implicit Suicidal Ideation via Longitudinal and Information Environment Signals on Social Media
Soorya Ram Shimgekar, Ruining Zhao, Agam Goyal, Violeta J. Rodriguez, Paul A. Bloom · Oct 16, 2025
On social media, several individuals experiencing suicidal ideation (SI) do not disclose their distress explicitly.
- LUMI: Unsupervised Intent Clustering with Multiple Pseudo-Labels
I-Fan Lin, Faegheh Hasibi, Suzan Verberne · Oct 16, 2025
Our evaluation on four benchmark sets shows that our approach achieves competitive results, better than recent state-of-the-art baselines, while avoiding the need to estimate the number of clusters during embedding refinement, as is require
- Understanding the Ability of LLMs to Handle Character-Level Perturbation
Anyuan Zhuo, Xuefei Ning, Ningyuan Li, Jingyi Zhu, Yu Wang · Oct 16, 2025
Surprisingly, even under severe perturbation, such as shuffling nearly all words character-wise to produce text that is almost unreadable to humans, or inserting invisible characters which are several times more than the visible ones as noi
- Assessing Web Search Credibility and Response Groundedness in Chat Assistants
Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko · Oct 15, 2025
Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources.
- Closing the Gap Between Text and Speech Understanding in LLMs
Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu · Oct 15, 2025
Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech d
- MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning
Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan · Oct 15, 2025
Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%.
- Embedding-Based Context-Aware Reranker
Ye Yuan, Mohammad Amin Shabani, Siqi Liu · Oct 15, 2025
We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
- Toward LLM-Supported Automated Assessment of Critical Thinking Subskills
Marisa C. Peczuh, Nischal Ashok Kumar, Ryan Baker, Blair Lehman, Danielle Eisenberg · Oct 14, 2025
Rubric Rating
As the world becomes increasingly saturated with AI-generated content, disinformation, and algorithmic persuasion, critical thinking - the capacity to evaluate evidence, detect unreliable claims, and exercise independent judgment - is becom
- PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation
Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu · Oct 14, 2025
Long Horizon
Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining s
- Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing
Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu · Oct 14, 2025
Pairwise Preference
Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference.
- FML-bench: Benchmarking Machine Learning Agents for Scientific Research
Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen · Oct 12, 2025
Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments.
- Mapping Semantic & Syntactic Relationships with Geometric Rotation
Michael Freenor, Lauren Alvarez · Oct 10, 2025
Demonstrations
Understanding how language and embedding models encode semantic relationships is fundamental to model interpretability.
- The Speech-LLM Takes It All: A Truly Fully End-to-End Spoken Dialogue State Tracking Approach
Nizar El Ghazal, Antoine Caubrière, Valentin Vielzeuf · Oct 10, 2025
This paper presents a comparative study of context management strategies for end-to-end Spoken Dialog State Tracking using Speech-LLMs.
- Chlorophyll-a Mapping and Prediction in the Mar Menor Lagoon Using C2RCC-Processed Sentinel 2 Imagery
Antonio Martínez-Ibarra, Aurora González-Vidal, Adrián Cánovas-Rodríguez, Antonio F. Skarmeta · Oct 10, 2025
The Mar Menor, Europe's largest hypersaline coastal lagoon, located in southeastern Spain, has undergone severe eutrophication crises, with devastating impacts on biodiversity and water quality.
- Verifying Chain-of-Thought Reasoning via Its Computational Graph
Zheng Zhao, Yeskendir Koishekenov, Xianjun Yang, Naila Murray, Nicola Cancedda · Oct 10, 2025
Current Chain-of-Thought (CoT) verification methods predict reasoning correctness based on outputs (black-box) or activations (gray-box), but offer limited insight into why a computation fails.
- Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs
Yumin Choi, Dongki Kim, Jinheon Baek, Sung Ju Hwang · Oct 10, 2025
To tackle this problem, we then propose the Multimodal Prompt Optimizer (MPO), a unified framework that not only performs the joint optimization of multimodal prompts through alignment-preserving updates but also guides the selection proces
- FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs
Yan Wang, Keyi Wang, Shanshan Yang, Jaisal Patel, Jeff Zhao · Oct 10, 2025
We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings.
- Lossless Vocabulary Reduction for Auto-Regressive Language Models
Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba · Oct 9, 2025
Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models.
- Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility
Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger · Oct 9, 2025
We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by L
- PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing
Anthony Hughes, Vasisht Duddu, N. Asokan, Nikolaos Aletras, Ning Ma · Oct 8, 2025
Language models (LMs) may memorize personally identifiable information (PII) from training data, enabling adversaries to extract it during inference.
- EconCausal: A Context-Aware Causal Reasoning Benchmark for Large Language Models in Social Science
Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park · Oct 8, 2025
To address this, we introduce EconCausal, a large-scale benchmark comprising 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies published in top-tier economics and finance journals.
- Multi-hop Deep Joint Source-Channel Coding with Deep Hash Distillation for Semantically Aligned Image Recovery
Didrik Bergström, Deniz Gündüz, Onur Günlü · Oct 8, 2025
We consider image transmission via deep joint source-channel coding (DeepJSCC) over multi-hop additive white Gaussian noise (AWGN) channels by training a DeepJSCC encoder-decoder pair with a pre-trained deep hash distillation (DHD) module t
- Peeking inside the Black-Box: Reinforcement Learning for Explainable and Accurate Relation Extraction
Xinyu Guo, Zhengliang Shi, Minglai Yang, Mahdi Rahimi, Mihai Surdeanu · Oct 7, 2025
Finally, human evaluation shows that our best model generates relational keywords closely aligned with gold labels, increasing human explanation quality ratings by 54% (relative).
- Early Multimodal Prediction of Cross-Lingual Meme Virality on Reddit: A Time-Window Analysis
Sedat Dogan, Nina Dethlefs, Debarati Chakraborty · Oct 7, 2025
We benchmark interpretable baselines (XGBoost, MLP) against end-to-end deep models (BERT, InceptionV3, CLIP) across early observation windows from 30 to 420 minutes.
- Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies
Chunsan Hong, Seonho An, Min-Soo Kim, Jong Chul Ye · Oct 7, 2025
Empirically, across four benchmarks, our learned policy consistently outperforms max-confidence: for example, on SUDOKU, where unmasking order is critical, it yields a 20.1% gain over random and a 11.2% gain over max-confidence.
- AgentDR: Dynamic Recommendation with Implicit Item-Item Relations via LLM-based Agents
Mingdai Yang, Nurendra Choudhary, Jiangshu Du, Edward W. Huang, Philip S. Yu · Oct 7, 2025
Recent agent-based recommendation frameworks aim to simulate user behaviors by incorporating memory mechanisms and prompting strategies, but they struggle with hallucinating non-existent items and full-catalog ranking.