CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Q: What is the best open-source implementation of "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark"?

The best maintained implementation is siegelz/core-bench with 77 stars on GitHub. Confidence: high. Reproducibility: Moderate.

Q: How reproducible is "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with siegelz/core-bench and validate setup instructions in README.

Q: What framework is used to implement "CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark"?

The primary implementation uses none.

Published: Sep 1, 2024

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 1

Top repo stars: 77

Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.

Framework: none

Time to first repro: a few hours

1 risk flag

arXiv PDF

Technical details

Canonical key: arxiv-2409.11363

Cache status: Fresh

Generated at: Jun 25, 2026, 8:33 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few hours

1 risk flag

none

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark focuses on agentic tool use.

Use This Implementation Because…

Confidence: high

siegelz/core-bench is the strongest maintained implementation based on ranking signals. License is declared (MIT). Dependency/environment manifests are present.

Open siegelz/core-bench

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 90/100, grounding 85/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

siegelz/core-bench

best maintained

Maintenance: Stale risk

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 77
Last push: Nov 23, 2025 (214d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

princeton-pli/hal-harness

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Community adoption signal (305 stars)

Stars: 305
Last push: Jun 23, 2026 (2d ago)

CI Dependencies

Risk flags

No tagged releases
No Docker setup
Low confidence match

EnvCommons/CoreBench-Hard

alternative

Maintenance: Recently updated

Confidence: Low

Reproducibility: Limited

Matched via arXiv identifier search

Stars: 0
Last push: Mar 29, 2026 (88d ago)

Dockerfile Dependencies

Risk flags

No CI pipeline detected
No tagged releases
Low confidence match

Best implementation now

siegelz/core-bench

Confidence: High

Reproducibility: Moderate

siegelz/core-bench

Stars: 77

Forks: 8

Last push: Nov 23, 2025

License: MIT

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Community adoption signal (77 stars)

License ✓

CI –

Deps ✓

Docker –

Selected siegelz/core-bench as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Jun 25, 2026

Dependencies pinned, manual setup needed

· siegelz/core-bench has requirements.txt but requires manual environment setup.
· Last push was 214 days ago — expect possible dependency version conflicts.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open siegelz/core-bench

Quick start

git clone https://github.com/siegelz/core-bench.git
pip install -r requirements.txt

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (3)

These repositories had low-confidence matching signals and are hidden by default.

princeton-pli/hal-harness

Confidence: Low

Stars: 305
EnvCommons/CoreBench-Hard

Confidence: Low

Stars: 0
GeneralReasoning/env-corebench-easy

Confidence: Low

Stars: 0

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2409.11363 CORE-Bench AI Agents

Datasets

arxiv:2409.11363 CORE-Bench dataset

Spaces

arxiv:2409.11363 CORE-Bench demo

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Tasks

Agentic tool use

Methods

Agentic systems

Domains

AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Agentic tool use Agentic systems AI Agents