- AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling
Liang Ding · Mar 22, 2026 · Citations: 0
Demonstrations Human EvalLlm As Judge
LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
- PubMed Reasoner: Dynamic Reasoning-based Retrieval for Evidence-Grounded Biomedical Question Answering
Yiqing Zhang, Xiaozhong Liu, Fabricio Murai · Mar 28, 2026 · Citations: 0
Expert Verification Llm As JudgeAutomatic Metrics
In this context, we introduce PubMed Reasoner, a biomedical QA agent composed of three stages: self-critic query refinement evaluates MeSH terms for coverage, alignment, and redundancy to enhance PubMed queries based on partial (metadata)…
- When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou · Apr 1, 2026 · Citations: 0
Critique Edit Simulation Env
As LLM agents transition from short, static problem solving to executing complex, long-horizon tasks in dynamic environments, the ability to handle user interruptions, such as adding requirement or revising goals, during mid-task execution…
- $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners
Harman Singh, Xiuyu Li, Kusha Sareen, Monishwaran Maheswaran, Sijun Tan · Mar 4, 2026 · Citations: 0
Pairwise Preference Automatic Metrics
On code generation (LiveCodeBench, CodeContests, SWE-Bench) and math reasoning (AIME, HMMT) benchmarks, V_1-Infer improves Pass@1 by up to 10% over pointwise verification and outperforms recent test-time scaling methods while being…
- How Reliable is Language Model Micro-Benchmarking?
Gregory Yauney, Shahzaib Saqib Warraich, Swabha Swayamdipta · Oct 9, 2025 · Citations: 0
Pairwise Preference Automatic Metrics
We introduce a meta-evaluation measure for micro-benchmarking which investigates how well a micro-benchmark can rank two models as a function of their performance difference on the full benchmark.
- Go-Browse: Training Web Agents with Structured Exploration
Apurva Gandhi, Graham Neubig · Jun 4, 2025 · Citations: 0
Simulation Env
To address this, we propose Go-Browse, a method for automatically collecting diverse and realistic web agent data at scale through structured exploration of web environments.
- KLong: Training LLM Agent for Extremely Long-horizon Tasks
Yue Liu, Yingwei Ma, Yibo Miao, Yanhao Li, Yuchong Xie · Feb 19, 2026 · Citations: 0
Rubric Rating
Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics.
- Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu · Feb 15, 2026 · Citations: 0
Simulation Env
The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…
- Learning When to Sample: Confidence-Aware Self-Consistency for Efficient LLM Chain-of-Thought Reasoning
Juming Xiong, Kevin Guo, Congning Ni, Chao Yan, Katherine Brown · Mar 9, 2026 · Citations: 0
Automatic Metrics
Recent self-consistency-based approaches further improve accuracy but require sampling and aggregating multiple reasoning trajectories, leading to substantial additional computational overhead.
- SWE-Protégé: Learning to Selectively Collaborate With an Expert Unlocks Small Language Models as Software Engineering Agents
Patrick Tser Jern Kon, Archana Pradeep, Ang Chen, Alexander P. Ellis, Warren Hunt · Feb 25, 2026 · Citations: 0
Automatic Metrics
Our approach combines supervised fine-tuning on expert-augmented trajectories with agentic reinforcement learning that explicitly discourages degenerative looping and unproductive expert collaboration.
- D-COT: Disciplined Chain-of-Thought Learning for Efficient Reasoning in Small Language Models
Shunsuke Ubukata · Feb 25, 2026 · Citations: 0
Automatic Metrics
In this study, we propose Disciplined Chain-of-Thought (D-CoT), a novel framework that enforces a structured reasoning process using control tags -- such as <TEMP_LOW> for fact-checking and <TEMP_HIGH> for multi-perspective exploration --…
- Structurally Aligned Subtask-Level Memory for Software Engineering Agents
Kangning Shen, Jingyuan Zhang, Chenxi Sun, Wencong Zeng, Yang Yue · Feb 25, 2026 · Citations: 0
Automatic Metrics
Large Language Models (LLMs) have demonstrated significant potential as autonomous software engineering (SWE) agents.
- R-WoM: Retrieval-augmented World Model For Computer-use Agents
Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee · Oct 13, 2025 · Citations: 0
Simulation Env
Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration.
- Cross-Context Verification: Hierarchical Detection of Benchmark Contamination through Session-Isolated Analysis
Tae-Eun Song · Mar 23, 2026 · Citations: 0
Automatic Metrics
LLM coding benchmarks face a credibility crisis: widespread solution leakage and test quality issues undermine SWE-bench Verified, while existing detection methods--paraphrase consistency, n-gram overlap, perplexity analysis--never directly…
- Confidence-Driven Multi-Scale Model Selection for Cost-Efficient Inference
Bo-Wei Chen, Chung-Chi Chen, An-Zi Yen · Feb 25, 2026 · Citations: 0
Automatic Metrics
Experiments on the Massive Multitask Language Understanding (MMLU) benchmark show that our approach achieves accuracy comparable to the largest model while reducing computational costs by 20\% to 40\%.
- Inducing Epistemological Humility in Large Language Models: A Targeted SFT Approach to Reducing Hallucination
Cem Uluoglakci, Tugba Taskaya Temizel · Mar 18, 2026 · Citations: 0
Pairwise Preference
We also release HypoTermQA-Enhanced, a benchmark for hallucination tendency strengthened through multiple validations.
- Long Grounded Thoughts: Synthesizing Visual Problems and Reasoning Chains at Scale
David Acuna, Chao-Han Huck Yang, Yuntian Deng, Jaehun Jung, Ximing Lu · Nov 7, 2025 · Citations: 0
Pairwise Preference
We introduce a framework able to synthesize vision-centric problems spanning diverse levels of complexity, and the resulting dataset with over 1M high-quality problems including: reasoning traces, preference data, and instruction prompts…
- Modeling LLM Unlearning as an Asymmetric Two-Task Learning Problem
Zeguan Xiao, Siqing Li, Yong Wang, Xuetao Wei, Jian Yang · Apr 16, 2026 · Citations: 0
- WebXSkill: Skill Learning for Autonomous Web Agents
Zhaoyang Wang, Qianhui Wu, Xuchao Zhang, Chaoyun Zhang, Wenlin Yao · Apr 14, 2026 · Citations: 0
- Hidden Measurement Error in LLM Pipelines Distorts Annotation, Evaluation, and Benchmarking
Solomon Messing · Apr 13, 2026 · Citations: 0
- SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions
Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel · Apr 9, 2026 · Citations: 0
- Dead Weights, Live Signals: Feedforward Graphs of Frozen Language Models
Marcus Armstrong, Navid Ayoobi, Arjun Mukherjee · Apr 9, 2026 · Citations: 0
- Activation Steering for Aligned Open-ended Generation without Sacrificing Coherence
Niklas Herbster, Martin Zborowski, Alberto Tosato, Gauthier Gidel, Tommaso Tosato · Apr 9, 2026 · Citations: 0
- Sensitivity-Positional Co-Localization in GQA Transformers
Manoj Chandrashekar Rao · Apr 9, 2026 · Citations: 0
- Symbiotic-MoE: Unlocking the Synergy between Generation and Understanding
Xiangyue Liu, Zijian Zhang, Miles Yang, Zhao Zhong, Liefeng Bo · Apr 9, 2026 · Citations: 0
- Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi · Apr 1, 2026 · Citations: 0
- Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana · Mar 26, 2026 · Citations: 0
- Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Michael Hardy, Joshua Gilbert, Benjamin Domingue · Mar 26, 2026 · Citations: 0
- Lie to Me: How Faithful Is Chain-of-Thought Reasoning in Reasoning Models?
Richard J. Young · Mar 23, 2026 · Citations: 0
- AdaRubric: Task-Adaptive Rubrics for LLM Agent Evaluation
Liang Ding · Mar 22, 2026 · Citations: 0
- FailureMem: A Failure-Aware Multimodal Framework for Autonomous Software Repair
Ruize Ma, Yilei Jiang, Shilin Zhang, Zheng Ma, Yi Feng · Mar 18, 2026 · Citations: 0
- Are Large Language Models Truly Smarter Than Humans?
Eshwar Reddy M, Sourav Karmakar · Mar 17, 2026 · Citations: 0
- daVinci-Env: Open SWE Environment Synthesis at Scale
Dayuan Fu, Shenyu Wu, Yunze Wu, Zerui Peng, Yaxing Huang · Mar 13, 2026 · Citations: 0
- AI Planning Framework for LLM-Based Web Agents
Orit Shahnovsky, Rotem Dror · Mar 13, 2026 · Citations: 0
- NeuroLoRA: Context-Aware Neuromodulation for Parameter-Efficient Multi-Task Adaptation
Yuxin Yang, Haoran Zhang, Mingxuan Li, Jiachen Xu, Ruoxi Shen · Mar 12, 2026 · Citations: 0
- In-Context Environments Induce Evaluation-Awareness in Language Models
Maheep Chaudhary · Mar 4, 2026 · Citations: 0
- Qwen3-Coder-Next Technical Report
Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng · Feb 28, 2026 · Citations: 0
- SWE-rebench V2: Language-Agnostic SWE Task Collection at Scale
Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Alexander Golubev · Feb 27, 2026 · Citations: 0
- Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu · Feb 8, 2026 · Citations: 0
- WebArbiter: A Principle-Guided Reasoning Process Reward Model for Web Agents
Yao Zhang, Shijie Tang, Zeyu Li, Zhen Han, Volker Tresp · Jan 29, 2026 · Citations: 0
- LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
Badri N. Patro, Vijay S. Agneeswaran · Jan 20, 2026 · Citations: 0
- Latent Self-Consistency for Reliable Majority-Set Selection in Short- and Long-Answer Reasoning
Jungsuk Oh, Jay-Yoon Lee · Aug 25, 2025 · Citations: 0
- Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents
Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune · May 29, 2025 · Citations: 0
- Beyond Final Code: A Process-Oriented Error Analysis of Software Development Agents in Real-World GitHub Scenarios
Zhi Chen, Wei Ma, Lingxiao Jiang · Mar 16, 2025 · Citations: 0