- CounselReflect: A Toolkit for Auditing Mental-Health Dialogues
Yahan Li, Chaohao Du, Zeyang Li, Christopher Chun Kuizon, Shupeng Cheng · Mar 31, 2026 · Citations: 0
Rubric RatingExpert Verification Human Eval Web Browsing
The system integrates two families of evaluation signals: (i) 12 model-based metrics produced by task-specific predictors, and (ii) rubric-based metrics that extend coverage via a literature-derived library (69 metrics) and user-defined…
- PRBench: End-to-end Paper Reproduction in Physics Research
Shi Qiu, Junyi Deng, Yiwei Deng, Haoran Dong, Jieyu Fu · Mar 29, 2026 · Citations: 0
Rubric RatingExpert Verification Automatic MetricsSimulation Env
We introduce PRBench, a benchmark of 30 expert-curated tasks spanning 11 subfields of physics.
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0
Critique Edit Simulation Env Long Horizon
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
- Paper Reconstruction Evaluation: Evaluating Presentation and Hallucination in AI-written Papers
Atsuyuki Miyai, Mashiro Toyooka, Zaiying Zhao, Kenta Watanabe, Toshihiko Yamasaki · Apr 1, 2026 · Citations: 0
Rubric Rating Automatic Metrics
We introduce Paper Reconstruction Evaluation (PaperRecon), an evaluation framework in which an overview (overview.md) is created from an existing paper, after which an agent generates a full paper based on the overview and minimal…
- Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang · Apr 1, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
We study whether phone-use agents respect privacy while completing benign mobile tasks.
- LUDOBENCH: Evaluating LLM Behavioural Decision-Making Through Spot-Based Board Game Scenarios in Ludo
Ojas Jain, Dhruv Kumar · Apr 7, 2026 · Citations: 0
Simulation Env Multi Agent
We introduce LudoBench, a benchmark for evaluating LLM strategic reasoning in Ludo, a stochastic multi-agent board game whose dice mechanics, piece capture, safe-square navigation, and home-path progression introduce meaningful planning…
- Dynamically Acquiring Text Content to Enable the Classification of Lesser-known Entities for Real-world Tasks
Fahmida Alam, Ellen Riloff · Apr 24, 2026 · Citations: 0
Expert Verification Automatic Metrics
We propose a novel text acquisition method that leverages both web and large language models (LLMs).
- QED-Nano: Teaching a Tiny Model to Prove Hard Theorems
LM-Provers, Yuxiao Qu, Amrith Setlur, Jasper Dekoninck, Edward Beeching · Apr 6, 2026 · Citations: 0
Rubric Rating Automatic Metrics
To support further research on open mathematical reasoning, we release the full QED-Nano pipeline, including the QED-Nano and QED-Nano-SFT models, the FineProofs-SFT and FineProofs-RL datasets, and the training and evaluation code.
- The Ultimate Tutorial for AI-driven Scale Development in Generative Psychometrics: Releasing AIGENIE from its Bottle
Lara Russell-Lasalandra, Hudson Golino, Luis Eduardo Garrido, Alexander P. Christensen · Mar 30, 2026 · Citations: 0
Critique Edit Tool Use
Psychological scale development has traditionally required extensive expert involvement, iterative revision, and large-scale pilot testing before psychometric evaluation can begin.
- Navigating Large-Scale Document Collections: MuDABench for Multi-Document Analytical QA
Zhanli Li, Yixuan Cao, Lvzhou Luo, Ping Luo · Apr 24, 2026 · Citations: 0
Automatic Metrics Multi Agent
We present MuDABench, a benchmark for multi-document analytical QA, where questions require extracting and synthesizing information across numerous documents to perform quantitative analysis.
- How Large Language Models Balance Internal Knowledge with User and Document Assertions
Shuowei Li, Haoxin Li, Wenda Chu, Yi Fang · Apr 24, 2026 · Citations: 0
Pairwise Preference Simulation Env
A model's ability to reliably process these sources is key to system safety.
- S0 Tuning: Zero-Overhead Adaptation of Hybrid Recurrent-Attention Models
Jack Young · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Using roughly 48 execution-verified HumanEval training solutions, tuning a single initial state matrix per recurrent layer, with zero inference overhead, outperforms LoRA by +10.8 pp (p < 0.001) on HumanEval.
- QDTraj: Exploration of Diverse Trajectory Primitives for Articulated Objects Robotic Manipulation
Mathilde Kappel, Mahdi Khoramshahi, Louis Annabi, Faiz Ben Amar, Stéphane Doncieux · Apr 24, 2026 · Citations: 0
Simulation Env Long Horizon
To do so, we propose a method based on Quality-Diversity algorithms that leverages sparse reward exploration in order to generate a set of diverse and high-performing trajectory primitives for a given manipulation task.
- Training-Free Dynamic Upcycling of Expert Language Models
Eros Fanì, Oğuzhan Ersoy · Mar 31, 2026 · Citations: 0
Expert Verification
To address these issues, we introduce Dynamic Upcycling MoE (DUME), a novel approach that reuses dense experts trained on different domains to construct a unified MoE model.
- SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
Sihang, Zhao, Kangrui Yu, Youliang Yuan, Pinjia He · Apr 24, 2026 · Citations: 0
Red Team Automatic Metrics
To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial…
- Courtroom-Style Multi-Agent Debate with Progressive RAG and Role-Switching for Controversial Claim Verification
Masnun Nuha Chowdhury, Nusrat Jahan Beg, Umme Hunny Khan, Syed Rifat Raiyan, Md Kamrul Hasan · Mar 30, 2026 · Citations: 0
Automatic Metrics Multi Agent
We propose a courtroom-style multi-agent framework, PROClaim, that reformulates verification as a structured, adversarial deliberation.
- SkillX: Automatically Constructing Skill Knowledge Bases for Agents
Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang · Apr 6, 2026 · Citations: 0
Automatic Metrics Long Horizon
Learning from experience is critical for building capable large language model (LLM) agents, yet prevailing self-evolving paradigms remain inefficient: agents learn in isolation, repeatedly rediscover similar behaviors from limited…
- Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning
Eric Hanchen Jiang, Levina Li, Rui Sun, Xiao Liang, Yubei Li · Apr 1, 2026 · Citations: 0
Automatic Metrics Multi Agent
In this paper, we propose Agent Q-Mix, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem.
- Learning to Predict Future-Aligned Research Proposals with Language Models
Heng Wang, Pengcheng Jiang, Jiashuo Sun, Zhiyi Shi, Haofei Yu · Mar 28, 2026 · Citations: 0
Human EvalAutomatic Metrics
Across Llama-3.1 and Qwen2.5 models, future-aligned tuning improves future alignment over unaligned baselines (up to +10.6% overall FAS), and domain-expert human evaluation corroborates improved proposal quality.
- Paper Circle: An Open-source Multi-agent Research Discovery and Analysis Framework
Komal Kumar, Aman Chadha, Salman Khan, Fahad Shahbaz Khan, Hisham Cholakkal · Apr 7, 2026 · Citations: 0
Automatic Metrics Multi Agent
Recent advances in multi-agent large language models (LLMs) have demonstrated strong potential for understanding user intent and are being trained to utilize various tools.
- Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory:…
- TRIMS: Trajectory-Ranked Instruction Masked Supervision for Diffusion Language Models
Lingjie Chen, Ruizhong Qiu, Yuyu Fan, Yanjun Zhao, Hanghang Tong · Apr 1, 2026 · Citations: 0
Automatic Metrics Long Horizon
Experiments on LLaDA and Dream across math and coding benchmarks show that TRIMS significantly improves the accuracy-parallelism trade-off over both standard MDLM training and train-free acceleration baselines, while achieving competitive…
- Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
Michael Cuccarese · Apr 7, 2026 · Citations: 0
Demonstrations
This paper presents epistemic blinding in the context of an agentic system that uses large language models to reason across multiple biological datasets for drug target prioritization.
- The Detection-Extraction Gap: Models Know the Answer Before They Can Say It
Hanyang Wang, Mingxuan Zhu · Apr 8, 2026 · Citations: 0
Automatic Metrics Tool Use
Across five model configurations, two families, and three benchmarks, we find that 52--88% of chain-of-thought tokens are produced after the answer is recoverable from a partial prefix.
- AgentGL: Towards Agentic Graph Learning with LLMs via Reinforcement Learning
Yuanfu Sun, Kang Li, Dongzhe Fan, Jiajin Liu, Qiaoyu Tan · Apr 7, 2026 · Citations: 0
Automatic Metrics Tool Use
To bridge this gap, we introduce Agentic Graph Learning (AGL), a paradigm that reframes graph learning as an interleaved process of topology-aware navigation and LLM-based inference.
- From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu · Apr 2, 2026 · Citations: 0
Automatic Metrics Web Browsing
Extensive evaluations across 1.5B--14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance.
- Oblivion: Self-Adaptive Agentic Memory Control through Decay-Driven Activation
Ashish Rana, Chia-Chien Hung, Qumeng Sun, Julian Martin Kunkel, Carolin Lawrence · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
Human memory adapts through selective forgetting: experiences become less accessible over time but can be reactivated by reinforcement or contextual cues.
- Hierarchical Chain-of-Thought Prompting: Enhancing LLM Reasoning Performance and Efficiency
Xingshuai Huang, Derek Li, Bahareh Nikpour, Parsa Omidi · Mar 31, 2026 · Citations: 0
Automatic Metrics Long Horizon
Extensive evaluations across diverse LLMs and mathematical reasoning benchmarks show that Hi-CoT consistently improves average accuracy by 6.2% (up to 61.4% on certain models and tasks) while reducing reasoning trace length by 13.9%…
- SSG: Logit-Balanced Vocabulary Partitioning for LLM Watermarking
Chenxi Gu, Xiaoning Du, John Grundy · Apr 24, 2026 · Citations: 0
Pairwise Preference
A crucial step in the KGW method is random vocabulary partitioning, which enables adjustments to token selection based on specific preferences.
- Removing Sandbagging in LLMs by Training with Weak Supervision
Emil Ryd, Henning Bartsch, Julian Stastny, Joe Benton, Vivek Hebbar · Apr 23, 2026 · Citations: 0
Demonstrations
As AI systems begin to automate complex tasks, supervision increasingly relies on weaker models or limited human oversight that cannot fully verify output quality.
- From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection
Hongxu Zhou · Apr 7, 2026 · Citations: 0
Critique Edit
While structured feedback can mitigate this issue, existing approaches often rely on externally trained critics or symbolic tools, reducing agent autonomy.
- The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Andrew Ang, Nazym Azimbayev, Andrey Kim · Apr 2, 2026 · Citations: 0
Critique Edit
Agentic AI shifts the investor's role from analytical execution to oversight.
- Revision or Re-Solving? Decomposing Second-Pass Gains in Multi-LLM Pipelines
Jingjie Ning, Xueqi Li, Chengyu Yu · Apr 1, 2026 · Citations: 0
Critique Edit
We evaluate this design across two model pairs on three benchmarks spanning knowledge-intensive MCQ and competitive programming.
- Understanding Teacher Revisions of Large Language Model-Generated Feedback
Conrad Borchers, Luiz Rodrigues, Newarney Torrezão da Costa, Cleon Xavier, Rafael Ferreira Mello · Mar 29, 2026 · Citations: 0
Critique EditRlaif Or Synthetic Feedback
First, we find that teachers accept AI feedback without modification in about 80% of cases, while edited feedback tends to be significantly longer and subsequently shortened by teachers.