Skip to content
← Back to explorer

Efficient Agent Training for Computer Use

Yanheng He, Jiahe Jin, Pengfei Liu · May 20, 2025 · Citations: 0

Abstract

Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations. Starting with just 312 human-annotated computer use trajectories, we further augment them by synthesizing diverse alternative action decisions with Claude 3.7 Sonnet. Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141 relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released. By integrating robust human computer use skills with automated AI data synthesis capabilities, our method not only brought substantial improvements over training on human trajectories alone, but also significantly surpassed direct distillation from Claude 3.7 Sonnet. Code, data and models are available at https://github.com/GAIR-NLP/PC-Agent-E

HFEPX Relevance Assessment

This paper has direct human-feedback and/or evaluation protocol signal and is likely useful for eval pipeline design.

Eval-Fit Score

50/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

High-confidence candidate

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Demonstrations
  • Rater population: Unknown
  • Unit of annotation: Trajectory
  • Expertise required: Coding
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes:
  • Agentic eval: Long Horizon
  • Quality controls: Not reported
  • Confidence: 0.60
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

Windowsagentarena

Reported Metrics

No metric terms were extracted from the available abstract.

Research Brief

Deterministic synthesis

Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents. HFEPX signals include Demonstrations, Long Horizon with confidence 0.60. Updated from current HFEPX corpus.

Generated Mar 5, 2026, 3:20 AM · Grounded in abstract + metadata only

Key Takeaways

  • Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents.
  • We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: Windowsagentarena.
  • Verify metric definitions before comparing against your eval pipeline.

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • Scaling up high-quality trajectory data has long been a critical bottleneck for developing human-like computer use agents.
  • We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
  • Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141 relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released.

Why It Matters For Eval

  • We introduce PC Agent-E, an efficient agent training framework that significantly reduces reliance on large-scale human demonstrations.
  • Trained on these enriched trajectories, our PC Agent-E model achieved a remarkable 141 relative improvement, and even surpassed the Claude 3.7 Sonnet by 10% in relative terms on WindowsAgentArena-V2, an improved benchmark we also released.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Demonstrations

  • Gap: Evaluation mode is explicit

    No clear evaluation mode extracted.

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Windowsagentarena

  • Gap: Metric reporting is present

    No metric terms extracted.

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.