- AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents
Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu · Feb 15, 2026 · Citations: 0
Expert Verification Simulation Env Long Horizon
While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem.
- Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
Chi-Pin Huang, Yunze Man, Zhiding Yu, Min-Hung Chen, Jan Kautz · Jan 14, 2026 · Citations: 0
Pairwise Preference Simulation Env Long Horizon
Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodie
- TherapyProbe: Generating Design Knowledge for Relational Safety in Mental Health Chatbots Through Adversarial Simulation
Joydeep Chandra, Satyam Kumar Navneet, Yong Zhang · Feb 26, 2026 · Citations: 0
Expert Verification Simulation Env Multi Agent
As mental health chatbots proliferate to address the global treatment gap, a critical question emerges: How do we design for relational safety the quality of interaction patterns that unfold across conversations rather than the correctness…
- Embodied Task Planning via Graph-Informed Action Generation with Large Language Model
Xiang Li, Ning Yan, Masood Mortazavi · Jan 29, 2026 · Citations: 0
Simulation Env Long Horizon
We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture.
- BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents
Huanyao Zhang, Jiepeng Zhou, Bo Li, Bowen Zhou, Yanzhe Shan · Feb 13, 2026 · Citations: 0
Automatic MetricsSimulation Env Web Browsing
Multimodal large language models (MLLMs), equipped with increasingly advanced planning and tool-use capabilities, are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.
- From Pixels to Policies: Reinforcing Spatial Reasoning in Language Models for Content-Aware Layout Design
Sha Li, Stefano Petrangeli, Yu Shen, Xiang Chen · Feb 14, 2026 · Citations: 0
Critique Edit Simulation Env
We introduce LaySPA, a reinforcement learning framework that equips large language models (LLMs) with explicit and interpretable spatial reasoning for content-aware graphic layout design.
- FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Victor De Lima, Jiqun Liu, Grace Hui Yang · Feb 17, 2026 · Citations: 0
Human EvalSimulation Env Long Horizon
Within this framework, we construct framing-sensitive agent personas by fine-tuning language models with framing-conditioned loss attenuation, inducing targeted biases while preserving overall task competence.
- Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore · Feb 23, 2026 · Citations: 0
Red Team Simulation Env
Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue.
- Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Bingyang Zheng, Cheng Wang · Jan 15, 2026 · Citations: 0
Simulation Env Long Horizon
The advancement of artificial intelligence toward agentic science is currently bottlenecked by the challenge of ultra-long-horizon autonomy, the ability to sustain strategic coherence and iterative correction over experimental cycles spanni
- World-Model-Augmented Web Agents with Action Correction
Zhouzhou Shen, Xueyu Hu, Xiyun Li, Tianqing Fang, Juncheng Li · Feb 17, 2026 · Citations: 0
Llm As JudgeSimulation Env Multi Agent
To address these challenges, we propose WAC, a web agent that integrates model collaboration, consequence simulation, and feedback-driven action refinement.
- The Trinity of Consistency as a Defining Principle for General World Models
Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang · Feb 26, 2026 · Citations: 0
Simulation Env Long Horizon
To complement this conceptual framework, we introduce CoW-Bench, a benchmark centered on multi-frame reasoning and generation scenarios.
- Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng · Feb 26, 2026 · Citations: 0
Simulation Env Long Horizon
Group-based reinforcement learning (RL), such as GRPO, has advanced the capabilities of large language models on long-horizon agentic tasks.
- SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards
Dengjia Zhang, Xiaoou Liu, Lu Cheng, Yaqing Wang, Kenton Murray · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env Long Horizon
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- Self-Correcting VLA: Online Action Refinement via Sparse World Imagination
Chenyv Liu, Wentao Tan, Lei Zhu, Fengling Li, Jingjing Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Reinforcement learning enhances physical grounding through exploration yet typically relies on external reward signals that remain isolated from the agent's internal states.
- ARLArena: A Unified Framework for Stable Agentic Reinforcement Learning
Xiaoxuan Wang, Han Zhang, Haixin Wang, Yidan Shi, Ruoyan Li · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
Agentic reinforcement learning (ARL) has rapidly gained attention as a promising paradigm for training agents to solve complex, multi-step interactive tasks.
- Multimodal Multi-Agent Empowered Legal Judgment Prediction
Zhaolu Kang, Junhao Gong, Qingxi Chen, Hao Zhang, Jiaxin Liu · Jan 19, 2026 · Citations: 0
Simulation Env Multi Agent
Furthermore, we build JurisMM, a large dataset with over 100,000 recent Chinese judicial records, including both text and multimodal video-text data, enabling comprehensive evaluation.
- LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
Yue Yang, Shuo Cheng, Yu Fang, Homanga Bharadhwaj, Mingyu Ding · Feb 25, 2026 · Citations: 0
Simulation Env Long Horizon
We introduce a 21-task simulation benchmark consisting of two challenging suites: LIBERO-Long++ and Ultra-Long.
- Efficient Hierarchical Any-Angle Path Planning on Multi-Resolution 3D Grids
Victor Reijgwart, Cesar Cadena, Roland Siegwart, Lionel Ott · Feb 24, 2026 · Citations: 0
Simulation Env Long Horizon
Hierarchical, multi-resolution volumetric mapping approaches are widely used to represent large and complex environments as they can efficiently capture their occupancy and connectivity information.
- KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models
Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang · Jan 30, 2026 · Citations: 0
Automatic MetricsSimulation Env
Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation.
- Cooperative-Competitive Team Play of Real-World Craft Robots
Rui Zhao, Xihui Li, Yizheng Zhang, Yuzhen Liu, Zhong Zhang · Feb 24, 2026 · Citations: 0
Simulation Env Multi Agent
Multi-agent deep Reinforcement Learning (RL) has made significant progress in developing intelligent game-playing agents in recent years.
- Architecting AgentOS: From Token-Level Context to Emergent System-Level Intelligence
ChengYou Li, XiaoDong Liu, XiangBao Meng, XinYu Zhao · Feb 24, 2026 · Citations: 0
Simulation Env Multi Agent
The paradigm of Large Language Models is undergoing a fundamental transition from static inference engines to dynamic autonomous cognitive systems.While current research primarily focuses on scaling context windows or optimizing prompt engi
- Contextual Safety Reasoning and Grounding for Open-World Robots
Zachary Ravichandran, David Snyder, Alexander Robey, Hamed Hassani, Vijay Kumar · Feb 23, 2026 · Citations: 0
Simulation Env Web Browsing
Traditional safety approaches enforce fixed constraints in user-specified contexts, limiting their ability to handle the open-ended contextual variability of real-world deployment.
- MALLVI: A Multi-Agent Framework for Integrated Generalized Robotics Manipulation
Iman Ahmadi, Mehrshad Taji, Arad Mahdinezhad Kashani, AmirHossein Jadidi, Saina Kashani · Feb 18, 2026 · Citations: 0
Simulation Env Multi Agent
MALLVI presents a Multi Agent Large Language and Vision framework that enables closed-loop feedback driven robotic manipulation.
- OR-Agent: Bridging Evolutionary Search and Structured Research for Automated Algorithm Discovery
Qi Liu, Ruochen Hao, Can Li, Wanjing Ma · Feb 14, 2026 · Citations: 0
Simulation Env Multi Agent
We present OR-Agent, a configurable multi-agent research framework designed for automated exploration in rich experimental environments.
- Counterfactual Simulation Training for Chain-of-Thought Faithfulness
Peter Hase, Christopher Potts · Feb 24, 2026 · Citations: 0
Automatic MetricsSimulation Env
In this paper, we introduce a training method called Counterfactual Simulation Training (CST), which aims to improve CoT faithfulness by rewarding CoTs that enable a simulator to accurately predict a model's outputs over counterfactual…
- Context-Aware Mapping of 2D Drawing Annotations to 3D CAD Features Using LLM-Assisted Reasoning for Manufacturing Automation
Muhammad Tayyab Khan, Lequn Chen, Wenhe Feng, Seung Ki Moon · Feb 20, 2026 · Citations: 0
Automatic MetricsSimulation Env
When deterministic scoring cannot resolve an ambiguity, the system escalates to multimodal and constrained large-language-model reasoning, followed by a single human-in-the-loop (HITL) review step.
- LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure
Jaehong Cho, Hyunmin Choi, Guseul Heo, Jongse Park · Feb 26, 2026 · Citations: 0
Simulation Env
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager, Simon Münker, Alistair Plum, Achim Rettinger · Feb 26, 2026 · Citations: 0
Simulation Env
This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior.
- Dynamic Personality Adaptation in Large Language Models via State Machines
Leon Pielage, Ole Hätscher, Mitja Back, Bernhard Marschall, Benjamin Risse · Feb 25, 2026 · Citations: 0
Simulation Env
This work demonstrates the feasibility of modular, personality-adaptive architectures for education, customer support, and broader human-computer interaction.
- Scalable Kernel-Based Distances for Statistical Inference and Integration
Masha Naslidnyk · Feb 25, 2026 · Citations: 0
Simulation Env
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Knowledge-Driven Approach to Music Segmentation, Music Source Separation and Cinematic Audio Source Separation
Chun-wei Ho, Sabato Marco Siniscalchi, Kai Li, Chin-Hui Lee · Feb 25, 2026 · Citations: 0
Simulation Env
Evaluation on simulation data shows that score-guided learning achieves very good music segmentation and separation results.
- POMDPPlanners: Open-Source Package for POMDP Planning
Yaacov Pariente, Vadim Indelman · Feb 24, 2026 · Citations: 0
Simulation Env
We present POMDPPlanners, an open-source Python package for empirical evaluation of Partially Observable Markov Decision Process (POMDP) planning algorithms.
- PreScience: A Benchmark for Forecasting Scientific Contributions
Anirudh Ajith, Amanpreet Singh, Jay DeYoung, Nadav Kunievsky, Austin C. Kozlowski · Feb 24, 2026 · Citations: 0
Simulation Env
We introduce PreScience -- a scientific forecasting benchmark that decomposes the research process into four interdependent generative tasks: collaborator prediction, prior work selection, contribution generation, and impact prediction.
- InterviewSim: A Scalable Framework for Interview-Grounded Personality Simulation
Yu Li, Pranav Narayanan Venkit, Yada Pruksachatkun, Chien-Sheng Wu · Feb 23, 2026 · Citations: 0
Simulation Env
Existing evaluation approaches rely on demographic surveys, personality questionnaires, or short AI-led interviews as proxies, but lack direct assessment against what individuals actually said.
- Reasoning Capabilities of Large Language Models. Lessons Learned from General Game Playing
Maciej Świechowski, Adam Żychowski, Jacek Mańdziuk · Feb 22, 2026 · Citations: 0
Simulation Env
The main results indicate that three of the evaluated models generally perform well across most experimental settings, with performance degradation observed as the evaluation horizon increases (i.e., with a higher number of game steps).
- NIMMGen: Learning Neural-Integrated Mechanistic Digital Twins with LLMs
Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang · Feb 20, 2026 · Citations: 0
Simulation Env
Recent work has explored LLM-based agentic frameworks to automatically construct mechanistic models from data; however, existing problem settings substantially oversimplify real-world conditions, leaving it unclear whether LLM-generated…
- Neural Synchrony Between Socially Interacting Language Models
Zhining Zhang, Wentao Zhu, Chi Han, Yizhou Wang, Heng Ji · Feb 19, 2026 · Citations: 0
Simulation Env
Neuroscience has uncovered a fundamental mechanism of our social nature: human brain activity becomes synchronized with others in many social contexts involving interaction.
- ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz · Feb 18, 2026 · Citations: 0
Simulation Env
We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap.
- SimToolReal: An Object-Centric Policy for Zero-Shot Dexterous Tool Manipulation
Kushal Kedia, Tyler Ga Wei Lum, Jeannette Bohg, C. Karen Liu · Feb 18, 2026 · Citations: 0
Simulation Env
Abstract shows limited direct human-feedback or evaluation-protocol detail; use as adjacent methodological context.
- A Comparative Analysis of Social Network Topology in Reddit and Moltbook
Yiming Zhu, Gareth Tyson, Pan Hui · Feb 14, 2026 · Citations: 0
Simulation Env
Recent advances in agent-mediated systems have enabled a new paradigm of social network simulation, where AI agents interact with human-like autonomy.
- Transforming Science Learning Materials in the Era of Artificial Intelligence
Xiaoming Zhai, Kent Crippen · Feb 8, 2026 · Citations: 0
Simulation Env
However, these innovations also raise critical ethical and pedagogical concerns, including issues of algorithmic bias, data privacy, transparency, and the need for human oversight.
- How Well Can LLM Agents Simulate End-User Security and Privacy Attitudes and Behaviors?
Yuxuan Li, Leyang Li, Hao-Ping Lee, Sauvik Das · Feb 6, 2026 · Citations: 0
Simulation Env
A growing body of research assumes that large language model (LLM) agents can serve as proxies for how people form attitudes toward and behave in response to security and privacy (S&P) threats.
- DCoPilot: Generative AI-Empowered Policy Adaptation for Dynamic Data Center Operations
Minghao Li, Ruihang Wang, Rui Tan, Yonggang Wen · Feb 2, 2026 · Citations: 0
Simulation Env
However, manually designing piecewise deep reinforcement learning (DRL) agents cannot keep pace with frequent dynamics shifts and service-level agreement (SLA) changes of an evolving DC.