AgentHER: Hindsight Experience Replay for LLM Agent Trajectory Relabeling

Liang Ding · Mar 22, 2026 · Citations: 0

Demonstrations General Human Eval Llm As Judge Long Horizon Simulation Env Web Browsing

Open arXiv Find Implementation RSS feed Shortlist (0)

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Apr 7, 2026, 2:54 PM

Recent

Extraction refreshed

Apr 10, 2026, 5:08 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.80

Abstract

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience. We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation. The key insight is simple: a trajectory that fails goal A is often a correct demonstration for some achievable alternative goal B. AgentHER realises this idea through a four-stage pipeline -- failure classification, outcome extraction, LLM-guided prompt relabeling with confidence gating, and data packaging -- that converts discarded failures into high-quality SFT, DPO, and ShareGPT training data, with both zero-cost rule-based and LLM-judge implementations. On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations. Gains are consistent from 1.5B to 72B parameters (+5.8-9.2 pp) and compound under iterative redeployment (+2.1 pp over additional rounds). Human evaluation confirms 97.7% relabeling precision under multi-judge verification.

HFEPX Relevance Assessment

This paper has strong direct human-feedback and evaluation protocol signal and is suitable as a primary eval pipeline reference.

Best use

Primary benchmark and eval reference

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

80/100 • High

Use this as a primary source when designing or comparing eval protocols.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Extraction confidence: High

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Demonstrations

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching baseline performance with only 50% of successful demonstrations.

Evaluation Modes

strong

Human Eval, Llm As Judge, Simulation Env

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely discarded, wasting the dominant source of collected experience.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Benchmarks / Datasets

strong

WebArena, ToolBench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Reported Metrics

strong

Precision, Pass@1, Cost

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Rater Population

missing

Unknown

Confidence: Low Source: Persisted extraction missing

Rater source not explicitly reported.

Human Data Lens

Uses human feedback: Yes
Feedback types: Demonstrations
Rater population: Unknown
Unit of annotation: Trajectory
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Human Eval, Llm As Judge, Simulation Env
Agentic eval: Long Horizon, Web Browsing
Quality controls: Not reported
Confidence: 0.80
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

WebArenaToolBench

Reported Metrics

precisionpass@1cost

Research Brief

Deterministic synthesis

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely… HFEPX signals include Demonstrations, Human Eval, Llm As Judge with confidence 0.80. Updated from current HFEPX corpus.

Generated Apr 10, 2026, 5:08 AM · Grounded in abstract + metadata only

Key Takeaways

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et…
We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to…

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: WebArena, ToolBench.
Validate metric comparability (precision, pass@1, cost).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

llm-as-judge calibration agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.
On WebArena (Zhou et al., 2024) and ToolBench (Qin et al., 2024), AgentHER improves over success-only SFT by +7.1-11.7 pp across four model families (GPT-4o, Qwen2.5-72B/7B, LLaMA-3.1-8B), while achieving 2x data efficiency -- matching…

Why It Matters For Eval

LLM agents fail on the majority of real-world tasks -- GPT-4o succeeds on fewer than 15% of WebArena navigation tasks and below 55% pass@1 on ToolBench (Zhou et al., 2024; Qin et al., 2024) -- yet every failed trajectory is routinely…
We introduce AgentHER, a framework that recovers this lost training signal by adapting the Hindsight Experience Replay (HER; Andrychowicz et al., 2017) principle to natural-language agent trajectories for offline data augmentation.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Demonstrations
Pass: Evaluation mode is explicit

Detected: Human Eval, Llm As Judge, Simulation Env
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: WebArena, ToolBench
Pass: Metric reporting is present

Detected: precision, pass@1, cost

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 15.30 Shared tag: DemonstrationsShared tag: Simulation EnvShared tag: Long Horizon
- Shared 4 HFEPX protocol tags
- Aligned human feedback protocol
Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 12.40 Shared tag: Simulation EnvShared tag: Long HorizonShared tag: Web Browsing
- Shared 3 HFEPX protocol tags
- Aligned evaluation mode
When Users Change Their Mind: Evaluating Interruptible Agents in Long-Horizon Web Navigation
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 12.40 Shared tag: Simulation EnvShared tag: Long HorizonShared tag: Web Browsing
- Shared 3 HFEPX protocol tags
- Aligned evaluation mode
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
Benchmark-aligned Protocol Overlap

Citations: 0 Relevance: 12.10 Shared tag: Simulation EnvShared tag: Long HorizonShared tag: Web Browsing
- Shared 3 HFEPX protocol tags
- Aligned evaluation mode
Meanings and Measurements: Multi-Agent Probabilistic Grounding for Vision-Language Navigation
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 11.60 Shared tag: DemonstrationsShared tag: Simulation EnvShared tag: Web Browsing
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
Human-feedback overlap Protocol Overlap

Citations: 0 Relevance: 11.60 Shared tag: DemonstrationsShared tag: Simulation EnvShared tag: Long Horizon
- Shared 3 HFEPX protocol tags
- Aligned human feedback protocol
Build, Judge, Optimize: A Blueprint for Continuous Improvement of Multi-Agent Consumer Assistants
Protocol Overlap Protocol Overlap

Citations: 0 Relevance: 11.30 Shared tag: Llm As JudgeShared tag: Simulation EnvShared tag: Long Horizon
- Shared 3 HFEPX protocol tags
- Aligned evaluation mode
FrameRef: A Framing Dataset and Simulation Testbed for Modeling Bounded Rational Information Health
Protocol Overlap Protocol Overlap

Citations: 0 Relevance: 11.30 Shared tag: Human EvalShared tag: Simulation EnvShared tag: Long Horizon
- Shared 3 HFEPX protocol tags
- Aligned evaluation mode

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote