- From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang · Dec 21, 2025 · Citations: 0
Long Horizon
Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale.
- Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale
Saad Alqithami · Dec 21, 2025 · Citations: 0
- NASTaR: NovaSAR Automated Ship Target Recognition Dataset
Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim · Dec 20, 2025 · Citations: 0
- Exploration vs. Fixation: Scaffolding Divergent and Convergent Thinking for Human-AI Co-Creation with Generative Models
Chao Wen, Tung Phung, Pronita Mehrotra, Sumit Gulwani, Roger E. Beaty · Dec 20, 2025 · Citations: 0
We examine an approach grounded in the Geneplore model of creative cognition and instantiate it in a human-AI co-creation system, HAICo, for creative image generation.
- Towards Efficient Agents: A Co-Design of Inference Architecture and System
Weizhe Lin, Hui-Ling Zhen, Shuai Yang, Xian Wang, Renxi Liu · Dec 20, 2025 · Citations: 0
Long Horizon
The rapid development of large language model (LLM)-based agents has unlocked new possibilities for autonomous multi-turn reasoning and tool-augmented decision-making.
- DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation
Janghoon Han, Heegyu Kim, Changho Lee, Dahm Lee, Min Hyung Park · Dec 19, 2025 · Citations: 0
Rubric RatingExpert Verification Long Horizon
However, evaluating such reports remains challenging: report quality is multifaceted, making it difficult to determine what to assess and by what criteria; LLM-based judges may miss errors that require domain expertise to identify; and…
- Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad · Dec 19, 2025 · Citations: 0
ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
- RadImageNet-VQA: A Large-Scale CT and MRI Dataset for Radiologic Visual Question Answering
Léo Butsanets, Charles Corbière, Julien Khlaut, Pierre Manceron, Corentin Dancette · Dec 19, 2025 · Citations: 0
The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
- Value Under Ignorance in Universal Artificial Intelligence
Cole Wyeth, Marcus Hutter · Dec 18, 2025 · Citations: 0
We generalize the AIXI reinforcement learning agent to admit a wider class of utility functions.
- Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL
Khushboo Thaker, Yony Bresler · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- SFBD-OMNI: Bridge models for lossy measurement restoration with limited clean samples
Haoye Lu, Yaoliang Yu, Darren Lo · Dec 18, 2025 · Citations: 0
- Generative Adversarial Reasoner: Enhancing LLM Reasoning with Adversarial Reinforcement Learning
Qihao Liu, Luoxin Ye, Wufei Ma, Yu-Cheng Chou, Alan Yuille · Dec 18, 2025 · Citations: 0
Pairwise Preference
Across various mathematical benchmarks, the method delivers consistent gains over strong baselines with standard RL post-training.
- In-Context Algebra
Eric Todd, Jannik Brinkmann, Rohit Gandikota, David Bau · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Refusal Steering: Fine-grained Control over LLM Refusal Behaviour for Sensitive Topics
Iker García-Ferrero, David Montero, Roman Orus · Dec 18, 2025 · Citations: 0
Red Team
We replace fragile pattern-based refusal detection with an LLM-as-a-judge that assigns refusal confidence scores and we propose a ridge-regularized variant to compute steering vectors that better isolate the refusal--compliance direction.
- TTP: Test-Time Padding for Adversarial Detection and Robust Adaptation on Vision-Language Models
Zhiwei Li, Yitian Pang, Weining Wang, Zhenan Sun, Qi Li · Dec 18, 2025 · Citations: 0
Vision-Language Models (VLMs), such as CLIP, have achieved impressive zero-shot recognition performance but remain highly susceptible to adversarial perturbations, posing significant risks in safety-critical scenarios.
- Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Sara Papi, Javier Garcia Gilabert, Zachary Hopton, Vilém Zouhar, Carlos Escolano · Dec 18, 2025 · Citations: 0
We present Hearing to Translate, the first comprehensive test suite rigorously benchmarking 6 state-of-the-art SpeechLLMs against 16 strong direct and cascade systems that couple leading speech foundation models (SFM), with multilingual…
- Pretrained battery transformer (PBT): A foundation model for universal battery life prediction
Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang · Dec 18, 2025 · Citations: 0
- Agent Tools Orchestration Leaks More: Dataset, Benchmark, and Mitigation
Yuxuan Qiao, Dongqin Liu, Hongchang Yang, Wei Zhou, Songlin Hu · Dec 18, 2025 · Citations: 0
Evaluation of six state-of-the-art LLMs reveals pervasive risk: the average Overall Leakage Rate reaches 62.11% with an H-Score of only 52.90%.
- Adaptation of Agentic AI: A Survey of Post-Training, Memory, and Skills
Pengcheng Jiang, Jiacheng Lin, Zhiyi Shi, Zifeng Wang, Luxi He · Dec 18, 2025 · Citations: 0
Pairwise Preference Tool Use
Large language model (LLM) agents are moving beyond prompting alone.
- A Domain-Adapted Pipeline for Structured Information Extraction from Police Incident Announcements on Social Media
Mengfan Shen, Kangqi Song, Xindi Wang, Wei Jia, Tao Wang · Dec 18, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Social Story Frames: Contextual Reasoning about Narrative Intent and Reception
Joel Mire, Maria Antoniak, Steven R. Wilson, Zexin Ma, Achyutarama R. Ganti · Dec 17, 2025 · Citations: 0
- Stepwise Think-Critique: A Unified Framework for Robust and Interpretable LLM Reasoning
Jiaqi Xu, Cuiling Lan, Xuejin Chen, Yan Lu · Dec 17, 2025 · Citations: 0
- Learning continuous state of charge dependent thermal decomposition kinetics for Li-ion cathodes using Kolmogorov-Arnold Chemical Reaction Neural Networks (KA-CRNNs)
Benjamin C. Koenig, Sili Deng · Dec 17, 2025 · Citations: 0
- Physics-driven human-like working memory outperforms digital networks in dynamic vision
Jingli Liu, Huannan Zheng, Bohao Zou, Kezhou Yang · Dec 17, 2025 · Citations: 0
- Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections
Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn · Dec 17, 2025 · Citations: 0
- The Moralization Corpus: Frame-Based Annotation and Analysis of Moralizing Speech Acts across Diverse Text Genres
Maria Becker, Mirko Sommer, Lars Tapken, Yi Wan Teh, Bruno Brocai · Dec 17, 2025 · Citations: 0
Moralizations are pragmatically complex and often implicit, posing significant challenges for both human annotators and NLP systems.
- MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang · Dec 17, 2025 · Citations: 0
- Imitation Game: Reproducing Deep Learning Bugs Leveraging an Intelligent Agent
Mehil B Shah, Mohammad Masudur Rahman, Foutse Khomh · Dec 17, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Jun Zhang, Teng Wang, Yuying Ge, Yixiao Ge, Xinhao Li · Dec 16, 2025 · Citations: 0
We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks with strict quality criteria.
- A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Sohom Dey, Christian Matek · Dec 16, 2025 · Citations: 0
Deep learning methods could assist pathologists by extracting diagnostic information from routinely available HE-stained slides directly, yet comprehensive benchmarks for lymphoma subtyping on multicenter data are lacking.
- Dual-objective Language Models: Training Efficiency Without Overfitting
David Samuel, Lucas Georges Gabriel Charpentier · Dec 16, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- CSyMR: Benchmarking Compositional Music Information Retrieval in Symbolic Music Reasoning
Boyang Wang, Yash Vishe, Xin Xu, Zachary Novack, Xunyi Jiang · Dec 16, 2025 · Citations: 0
- GRAFT: Grid-Aware Load Forecasting with Multi-Source Textual Alignment and Fusion
Fangzhou Lin, Guoshun He, Zhenyu Guo, Zhe Huang, Jinsong Tao · Dec 16, 2025 · Citations: 0
- RePo: Language Models with Context Re-Positioning
Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat · Dec 16, 2025 · Citations: 0
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Systematic Analysis of Biases in Large Language Models
Xulang Zhang, Rui Mao, Erik Cambria · Dec 16, 2025 · Citations: 0
- Olmo 3
Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl · Dec 15, 2025 · Citations: 0
- Towards Interactive Intelligence for Digital Humans
Yiyi Cai, Xuangeng Chu, Xiwei Gao, Sitong Gong, Yifei Huang · Dec 15, 2025 · Citations: 0
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution.
- ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li · Dec 15, 2025 · Citations: 0
Extensive experiments on seven diverse benchmarks show that ReFusion not only overwhelmingly surpasses prior MDMs with a 34\% performance gain and an over 18\times speedup on average, but also bridges the performance gap to strong ARMs…
- NRR-Core: Non-Resolution Reasoning as a Computational Framework for Contextual Identity and Ambiguity Preservation
Kei Saito · Dec 15, 2025 · Citations: 0
In the narrow non-evaluative read adopted later in the series, the practical point is not that no judgment ever occurs, but that retained alternatives need not be implemented as repeated full branchwise comparative evaluation during…
- On the Effectiveness of Membership Inference in Targeted Data Extraction from Large Language Models
Ali Al Sahili, Ali Chehab, Razane Tajeddine · Dec 15, 2025 · Citations: 0
- What Makes an Ideal Quote? Recommending "Unexpected yet Rational" Quotations via Novelty
Bowei Zhang, Jin Xiao, Guanglei Yue, Qianyu He, Yanghua Xiao · Dec 15, 2025 · Citations: 0
A generative label agent first interprets each quotation and its surrounding context into multi-dimensional deep-meaning labels, enabling label-enhanced retrieval.
- GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training
Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi · Dec 15, 2025 · Citations: 0
- Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection
Xuwei Tan, Yao Ma, Xueru Zhang · Dec 15, 2025 · Citations: 0
Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions.
- Revisiting the Reliability of Language Models in Instruction-Following
Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei · Dec 15, 2025 · Citations: 0