- Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning
Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen · Nov 2, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Belief Dynamics Reveal the Dual Nature of In-Context Learning and Activation Steering
Eric Bigelow, Daniel Wurgaft, YingQiao Wang, Noah Goodman, Tomer Ullman · Nov 1, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Addressing Longstanding Challenges in Cognitive Science with Language Models
Dirk U. Wulff, Rui Mata · Oct 31, 2025 · Citations: 0
- Can SAEs reveal and mitigate racial biases of LLMs in healthcare?
Hiba Ahsan, Byron C. Wallace · Oct 31, 2025 · Citations: 0
- BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning
Qiusi Zhan, Hyeonjeong Ha, Rui Yang, Sirui Xu, Hanyang Chen · Oct 31, 2025 · Citations: 0
Pairwise Preference Long Horizon
We introduce BEAT, the first framework to inject such visual backdoors into VLM-based embodied agents using objects in the environments as triggers.
- When Distributions Shifts: Causal Generalization for Low-Resource Languages
Mahi Aliyu Aminu, Chisom Chibuike, Fatimo Adebanjo, Omokolade Awosanya, Samuel Oyeneye · Oct 31, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Analysing Environmental Efficiency in AI for X-Ray Diagnosis
Liam Kearns · Oct 31, 2025 · Citations: 0
This provides a benchmark study of 14 different model configurations for comparison of diagnostic accuracy and environmental impact.
- DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains
Tian Liang, Wenxiang Jiao, Zhiwei He, Jiahao Xu, Haitao Mi · Oct 31, 2025 · Citations: 0
Experimental results on challenging mathematical benchmarks show that DeepCompress consistently outperforms baseline methods, achieving superior accuracy while significantly improving token efficiency.
- Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs
Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani · Oct 31, 2025 · Citations: 0
Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative…
- Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton, York Hay Ng, Aditya Khan, Phuong Hanh Hoang, Xiang Lu · Oct 31, 2025 · Citations: 0
Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups.
- Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler · Oct 31, 2025 · Citations: 0
Multi Agent
Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts?
- Probability Distributions Computed by Autoregressive Transformers
Andy Yang, Anej Svete, Jiaoda Li, Anthony Widjaja Lin, Jonathan Rawski · Oct 31, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- VISTA: Verification In Sequential Turn-based Assessment
Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White · Oct 30, 2025 · Citations: 0
Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines.
- Reasoning Up the Instruction Ladder for Controllable Language Models
Zishuo Zheng, Vidhisha Balachandran, Chan Young Park, Faeze Brahman, Sachin Kumar · Oct 30, 2025 · Citations: 0
Red Team
Our finetuned models achieve consistent improvements on instruction following and instruction hierarchy benchmarks, achieving roughly a 20% improvement on the IHEval conflict setup.
- Frame Semantic Patterns for Identifying Underreporting of Notifiable Events in Healthcare: The Case of Gender-Based Violence
Lívia Dutra, Arthur Lorenzi, Laís Berno, Franciany Campos, Karoline Biscardi · Oct 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability
Usha Bhalla, Alex Oesterling, Claudio Mayrink Verdun, Himabindu Lakkaraju, Flavio P. Calmon · Oct 30, 2025 · Citations: 0
Translating the internal representations and computations of models into concepts that humans can understand is a key goal of interpretability.
- LLMs Process Lists With General Filter Heads
Arnab Sen Sharma, Giordano Rogers, Natalie Shapira, David Bau · Oct 30, 2025 · Citations: 0
Our results reveal that transformer LMs can develop human-interpretable implementations of abstract computational operations that generalize in ways that are surprisingly similar to strategies used in traditional functional programming…
- Evontree: Ontology Rule-Guided Self-Evolution of Large Language Models
Mingchen Tu, Zhiqiang Liu, Juan Li, Liangyurui Liu, Junjie Wang · Oct 30, 2025 · Citations: 0
Extensive evaluations on medical QA benchmarks using Llama3-8B-Instruct and Med42-V2 demonstrate the effectiveness of Evontree, which outperforms both the base models and strong baselines, achieving up to a 3.7\% improvement in accuracy.
- Inference-Cost-Aware Dynamic Tree Construction for Efficient Inference in Large Language Models
Yinrong Hong, Zhiquan Tan, Kai Hu · Oct 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- GraphKeeper: Graph Domain-Incremental Learning via Knowledge Disentanglement and Preservation
Zihao Guo, Qingyun Sun, Ziwei Zhang, Haonan Yuan, Huiping Zhuang · Oct 30, 2025 · Citations: 0
- Co-Evolving Latent Action World Models
Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang · Oct 30, 2025 · Citations: 0
- The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration
Kotaro Furuya, Yuichi Kitagawa · Oct 30, 2025 · Citations: 0
Pairwise Preference Multi Agent
While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition.
- SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection
Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan · Oct 30, 2025 · Citations: 0
SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions.
- Are Language Models Borrowing-Blind? A Multilingual Evaluation of Loanword Identification across 10 Languages
Mérilin Sousa Silva, Sina Ahmadi · Oct 30, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa · Oct 30, 2025 · Citations: 0
We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans.
- Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning
Yihe Deng, I-Hung Hsu, Jun Yan, Zifeng Wang, Rujun Han · Oct 29, 2025 · Citations: 0
Demonstrations Long Horizon
Beyond reasoning benchmarks, SRL generalizes effectively to agentic software engineering tasks, establishing it as a robust and versatile training framework for reasoning-oriented LLMs.
- RECAP: Reproducing Copyrighted Data from LLMs Training with an Agentic Pipeline
André V. Duarte, Xuying li, Bin Zeng, Arlindo L. Oliveira, Lei Li · Oct 29, 2025 · Citations: 0
Red Team
As such, we propose RECAP, an agentic pipeline designed to elicit and verify memorized training data from LLM outputs.
- Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters
Xingjian Zhang, Tianhong Gao, Suliang Jin, Tianhao Wang, Teng Ye · Oct 29, 2025 · Citations: 0
Large language models (LLMs) are increasingly used as raters for evaluation tasks.
- TheraMind: A Strategic and Adaptive Agent for Longitudinal Psychological Counseling
He Hu, Chiyuan Ma, Qianning Wang, Lin Liu, Yucheng Zhou · Oct 29, 2025 · Citations: 0
- The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu · Oct 29, 2025 · Citations: 0
Long Horizon
To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.
- From Medical Records to Diagnostic Dialogues: A Clinical-Grounded Approach and Dataset for Psychiatric Comorbidity
Tianxi Wan, Jiaming Luo, Siyuan Chen, Kunyao Lan, Jianhua Chen · Oct 29, 2025 · Citations: 0
Multi Agent
To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation.
- Activation-Space Personality Steering: Hybrid Layer Selection for Stable Trait Control in LLMs
Pranav Bhandari, Nicolas Fay, Sanjeevan Selvaganapathy, Amitava Datta, Usman Naseem · Oct 29, 2025 · Citations: 0
We propose a novel pipeline that extracts hidden state activations from transformer layers using the Big Five Personality Traits (Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism), which is a comprehensive and…
- World Simulation with Video Foundation Models for Physical AI
NVIDIA, :, Arslan Ali, Junjie Bai, Maciej Bala · Oct 28, 2025 · Citations: 0
Long Horizon
These capabilities enable more reliable synthetic data generation, policy evaluation, and closed-loop simulation for robotics and autonomous systems.
- Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish
Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu · Oct 28, 2025 · Citations: 0
In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages.
- Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents
Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou · Oct 28, 2025 · Citations: 0
- Repurposing Synthetic Data for Fine-grained Search Agent Supervision
Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang · Oct 28, 2025 · Citations: 0
Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy.
- Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin · Oct 28, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI, :, Bowen Ma, Cheng Zou, ChengKun Du · Oct 28, 2025 · Citations: 0
- LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
Julian Valline, Cedric Lothritz, Siwen Guo, Jordi Cabot · Oct 28, 2025 · Citations: 0
Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach, retaining 227,507 high-quality instruction-answer pairs.
- SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models
Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu · Oct 28, 2025 · Citations: 0
- Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
Lingyue Fu, Bolun Zhang, Hao Guan, Yaoming Zhu, Lin Qiu · Oct 28, 2025 · Citations: 0
Expert Verification
To address these challenges, we propose an agent-driven benchmark construction pipeline that leverages human supervision to efficiently generate diverse project-level tasks.
- Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren · Oct 28, 2025 · Citations: 0
Long Horizon
To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct…
- MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations
Aaron Scott, Maike Züfle, Jan Niehues · Oct 28, 2025 · Citations: 0
- GIFT: Group-Relative Implicit Fine-Tuning Integrates GRPO with DPO and UNA
Zhichao Wang · Oct 27, 2025 · Citations: 0
Pairwise Preference
This paper proposes Group-relative Implicit Fine-Tuning (GIFT), a reinforcement learning framework for aligning large language models (LLMs) that unifies on-policy optimization with implicit preference learning.
- Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language
Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab · Oct 27, 2025 · Citations: 0
We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural…
- A Survey of Data Agents: Emerging Paradigm or Overstated Hype?
Yizhang Zhu, Liangwei Wang, Chenyu Yang, Xiaotian Lin, Boyan Li · Oct 27, 2025 · Citations: 0
The rapid advancement of large language models (LLMs) has spurred the emergence of data agents, autonomous systems designed to orchestrate Data + AI ecosystems for tackling complex data-related tasks.
- RobotArena $\infty$: Scalable Robot Benchmarking via Real-to-Sim Translation
Yash Jangir, Yidi Zhang, Pang-Chi Lo, Kashu Yamazaki, Chenyu Zhang · Oct 27, 2025 · Citations: 0
- An Information-Theoretic Analysis of OOD Generalization in Meta-Reinforcement Learning
Xingtu Liu · Oct 27, 2025 · Citations: 0
- Quantifying Systemic Vulnerability in the Foundation Model Industry
Claudio Pirrone, Stefano Fricano, Gioacchino Fazio · Oct 27, 2025 · Citations: 0
- SwiftEmbed: Ultra-Fast Text Embeddings via Static Token Lookup for Real-Time Applications
Edouard Lansiaux, Antoine Simonet, Eric Wiel · Oct 27, 2025 · Citations: 0
Evaluation demonstrates exceptional duplicate detection performance (90.1% AP) and strong semantic similarity (76.1% Spearman correlation).
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning
Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan · Oct 27, 2025 · Citations: 0
Pairwise Preference
Motivated by the success of tool-integrated reasoning (TIR) in numerous tasks, we propose TIR-Judge, an end-to-end RL framework for training LLM judges that integrates a code executor for precise evaluation.
- Understanding In-Context Learning Beyond Transformers: An Investigation of State Space and Hybrid Architectures
Shenran Wang, Timothy Tin-Long Tse, Jian Zhu · Oct 27, 2025 · Citations: 0
- Batch Speculative Decoding Done Right
Ranran Haoran Zhang, Soumik Dey, Ashirbad Mishra, Hansi Wu, Binbin Li · Oct 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao · Oct 26, 2025 · Citations: 0
Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation.
- Low-Resource Dialect Adaptation of Large Language Models: A French Dialect Case-Study
Eeham Khan, Firas Saidani, Owen Van Esbroeck, Richard Khoury, Leila Kosseim · Oct 26, 2025 · Citations: 0
- REVISION:Reflective Intent Mining and Online Reasoning Auxiliary for E-commerce Visual Search System Optimization
Yiwen Tang, Qiuyu Zhao, Zenghui Sun, Jinsong Lan, Xiaoyong Zhu · Oct 26, 2025 · Citations: 0
Critique Edit
To alleviate the issue, we propose a novel framework REVISION.
- Rule-Based Explanations for Retrieval-Augmented LLM Systems
Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta · Oct 26, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Towards Scalable Oversight via Partitioned Human Supervision
Ren Yin, Takashi Ishida, Masashi Sugiyama · Oct 26, 2025 · Citations: 0
As artificial intelligence (AI) systems approach and surpass expert human performance across a broad range of tasks, obtaining high-quality human supervision for evaluation and training becomes increasingly challenging.
- VisJudge-Bench: Aesthetics and Quality Assessment of Visualizations
Yupeng Xie, Zhiyang Zhang, Yifan Wu, Sirong Lu, Jiayi Zhang · Oct 25, 2025 · Citations: 0
To address this, we propose VisJudge-Bench, the first comprehensive benchmark for evaluating MLLMs' performance in assessing visualization aesthetics and quality.
- WAON: Large-Scale Japanese Image-Text Pair Dataset for Improving Model Performance on Japanese Cultural Tasks
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, Yasuo Okabe · Oct 25, 2025 · Citations: 0
To improve the quality and reliability of evaluation on Japanese cultural tasks, we also construct WAON-Bench, a manually curated benchmark for Japanese cultural image classification comprising 374 classes, which addresses issues in the…