Measuring AI Ability to Complete Long Software Tasks

Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, Lawrence Chan · Mar 18, 2025 · Citations: 0

Automatic Metrics Expert Verification General Tool Use

Open arXiv Find Implementation RSS feed Shortlist (0)

Data freshness

Extraction: Fresh

Check recency before relying on this page for active eval decisions. Use stale pages as context and verify against current hub results.

Metadata refreshed

Feb 25, 2026, 2:20 AM

Stale

Extraction refreshed

Apr 13, 2026, 6:32 AM

Fresh

Extraction source

Persisted extraction

Confidence 0.80

Abstract

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.

HFEPX Relevance Assessment

This paper has useful evaluation signal, but protocol completeness is partial; pair it with related papers before deciding implementation strategy.

Best use

Secondary protocol comparison source

Use if you need

A benchmark-and-metrics comparison anchor.

Main weakness

No major weakness surfaced.

Trust level

High

Eval-Fit Score

65/100 • Medium

Useful as a secondary reference; validate protocol details against neighboring papers.

Human Feedback Signal

Detected

Evaluation Signal

Detected

HFEPX Fit

Moderate-confidence candidate

Extraction confidence: High

If you are doing eval pipeline work, start here:

Human Eval Hub LLM-as-Judge Hub Pairwise Preference Hub Tool-Use Eval Hub

Field Provenance & Confidence

Each key protocol field shows extraction state, confidence band, and data source so you can decide whether to trust it directly or validate from full text.

Human Feedback Types

strong

Expert Verification

Confidence: High Source: Persisted extraction evidenced

Directly usable for protocol triage.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Evaluation Modes

strong

Automatic Metrics

Confidence: High Source: Persisted extraction evidenced

Includes extracted eval setup.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Quality Controls

missing

Not reported

Confidence: Low Source: Persisted extraction missing

No explicit QC controls found.

Evidence snippet: Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.

Benchmarks / Datasets

strong

Re Bench

Confidence: High Source: Persisted extraction evidenced

Useful for quick benchmark comparison.

Evidence snippet: We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks.

Reported Metrics

strong

Success rate

Confidence: High Source: Persisted extraction evidenced

Useful for evaluation criteria comparison.

Evidence snippet: This is the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Rater Population

strong

Domain Experts

Confidence: High Source: Persisted extraction evidenced

Helpful for staffing comparability.

Evidence snippet: We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks.

Human Data Lens

Uses human feedback: Yes
Feedback types: Expert Verification
Rater population: Domain Experts
Unit of annotation: Unknown
Expertise required: General
Extraction source: Persisted extraction

Evaluation Lens

Evaluation modes: Automatic Metrics
Agentic eval: Tool Use
Quality controls: Not reported
Confidence: 0.80
Flags: None

Protocol And Measurement Signals

Benchmarks / Datasets

Re-Bench

Reported Metrics

success rate

Research Brief

Deterministic synthesis

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. HFEPX signals include Expert Verification, Automatic Metrics, Tool Use with confidence 0.80. Updated from current HFEPX corpus.

Generated Apr 13, 2026, 6:32 AM · Grounded in abstract + metadata only

Key Takeaways

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.

Researcher Actions

Compare its human-feedback setup against pairwise and rubric hubs.
Cross-check benchmark overlap: Re-Bench.
Validate metric comparability (success rate).

Caveats

Generated from title, abstract, and extracted metadata only; full-paper implementation details are not parsed.
Extraction confidence is probabilistic and should be validated for critical decisions.

Recommended Queries

human-eval protocol design agent eval benchmark comparison inter-rater agreement adjudication

Research Summary

Contribution Summary

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.
This is the time humans typically take to complete tasks that AI models can complete with 50% success rate.

Why It Matters For Eval

Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear.
To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon.

Researcher Checklist

Pass: Human feedback protocol is explicit

Detected: Expert Verification
Pass: Evaluation mode is explicit

Detected: Automatic Metrics
Gap: Quality control reporting appears

No calibration/adjudication/IAA control explicitly detected.
Pass: Benchmark or dataset anchors are present

Detected: Re-Bench
Pass: Metric reporting is present

Detected: success rate

Measuring AI Ability to Complete Long Software Tasks

Data freshness

Abstract

HFEPX Relevance Assessment

Field Provenance & Confidence

Human Feedback Types

Evaluation Modes

Quality Controls

Benchmarks / Datasets

Reported Metrics

Rater Population

Human Data Lens

Evaluation Lens

Protocol And Measurement Signals

Benchmarks / Datasets

Reported Metrics

Research Brief

Key Takeaways

Researcher Actions

Caveats

Recommended Queries

Research Summary

Contribution Summary

Why It Matters For Eval

Researcher Checklist

Related Papers