Skip to content
← Back to explorer

Measuring AI Ability to Complete Long Software Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan · Mar 18, 2025 · Citations: 0

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 25, 2026, 2:20 AM

Stale

Extraction refreshed

Apr 13, 2026, 6:32 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.80

Abstract

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

HFEPX Relevance Assessment

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

Moderate-confidence candidate

Extraction confidence: High

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Expert Verification

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Benchmarks / Datasets

strong

Re Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks.

Reported Metrics

strong

Success rate

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: This is the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Rater Population

strong

Domain Experts

Confidence: High Source: Persisted extraction evidenced

Helpful for staffing comparability.

Evidence snippet: We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks.

Human Data Lens

  • Uses human feedback: Yes
  • Feedback types: Expert Verification
  • Rater population: Domain Experts
  • Unit of annotation: Unknown
  • Expertise required: General
  • Extraction source: Persisted extraction

Evaluation Lens

  • Evaluation modes: Automatic Metrics
  • Agentic eval: Tool Use
  • Quality controls: Not reported
  • Confidence: 0.80
  • Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

Re-Bench

Reported Metrics

success rate

Research Brief

Deterministic synthesis

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. HFEPX signals include Expert Verification, Automatic Metrics, Tool Use with confidence 0.80. Updated from current HFEPX corpus.

Generated Apr 13, 2026, 6:32 AM · Grounded in abstract + metadata only

Key Takeaways

  • Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
  • To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.

Researcher Actions

  • Compare its human-feedback setup against pairwise and rubric hubs.
  • Cross-check benchmark overlap: Re-Bench.
  • Validate metric comparability (success rate).

Caveats

  • Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
  • Extraction confidence is probabilistic and should be validated for critical decisions.

Research Summary

Contribution Summary

  • Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
  • To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.
  • This is the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Why It Matters For Eval

  • Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
  • To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.

Researcher Checklist

  • Pass: Human feedback protocol is explicit

    Detected: Expert Verification

  • Pass: Evaluation mode is explicit

    Detected: Automatic Metrics

  • Gap: Quality control reporting appears

    No calibration/adjudication/IAA control explicitly detected.

  • Pass: Benchmark or dataset anchors are present

    Detected: Re-Bench

  • Pass: Metric reporting is present

    Detected: success rate

Related Papers

Papers are ranked by protocol overlap, extraction signal alignment, and semantic proximity.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.