- Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
- Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We study whether phone-use agents respect privacy while completing benign mobile tasks.
- CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks
Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu · Mar 19, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly…
- LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0
Simulation Env Multi Agent
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
- Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li · Apr 1, 2026 · Citations: 0
Automatic Metrics Multi Agent
In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
- LRC-WeatherNet: LiDAR, RADAR, and Camera Fusion Network for Real-time Weather-type Classification in Autonomous Driving
Nour Alhuda Albashir, Lars Pernickel, Danial Hamoud, Idriss Gouigah, Eren Erdal Aksoy · Mar 23, 2026 · Citations: 0
Automatic Metrics Web Browsing
Autonomous vehicles face major perception and navigation challenges in adverse weather such as rain, fog, and snow, which degrade the performance of LiDAR, RADAR, and RGB camera sensors.
- Effective Strategies for Asynchronous Software Engineering Agents
Jiayi Geng, Graham Neubig · Mar 23, 2026 · Citations: 0
Automatic Metrics Long Horizon
Inspired by these collaboration primitives, we introduce Centralized Asynchronous Isolated Delegation (CAID), a structured multi-agent coordination paradigm grounded in three core SWE primitives: centralized task delegation, asynchronous…
- Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song · Mar 23, 2026 · Citations: 0
Automatic Metrics Multi Agent
LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…