AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Q: How reproducible is "AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science"?

Estimated time to first reproduction: a few days. Risk flags: No repository-level reproducibility signals are currently available, Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.. Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.

An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding

Published: Mar 19, 2026

No direct implementation yet

Evidence: Inferred

Domain fit: AI-core

Verified repos: 0

Core AI workload signals detected from paper context and implementation/artifact evidence.

Time to first repro: a few days

2 risk flags

arXiv PDF

Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human ...

Read full abstract

expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

Technical details

Canonical key: arxiv-2603.19005

Cache status: Fresh

Generated at: Jun 18, 2026, 2:57 AM

Artifact coverage: direct

HF provider: ok (public)

PWC source used: No

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

context only

Benchmarks: missing

Time to repro: a few days

2 risk flags

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Data science plays a critical role in transforming complex data into actionable insights across numerous domains.

Implementation Evidence Summary

Confidence: low

No direct maintained repository implementation was found, but paper-linked Hugging Face artifacts are available.

Reproduction Risks

Estimate assumes artifact-level reproduction; full training reproduction may require additional paper details.

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 2 refs, 1 links.

Utility signals: depth 60/100, grounding 58/100, status medium.

Implementation Comparison

Top 1 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

AnLuo1/AgentDS-Evaluation-Artifacts

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Limited

Matched via arXiv identifier search

Stars: 1
Last push: May 22, 2026 (27d ago)

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Implementation Status

No verified maintained repo

There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.

Use the paper-linked Hugging Face release as the starting artifact, then reconstruct training and evaluation settings from the paper.
No direct maintained implementation was found. Use the paper PDF and citation graph to design a baseline reproduction.
Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few days