What is the best open-source implementation of "PostTrainBench: Can LLM Agents Automate LLM Post-Training?"?

The best maintained implementation is aisa-group/PostTrainBench with 308 stars on GitHub. Confidence: medium. Reproducibility: Limited.

How reproducible is "PostTrainBench: Can LLM Agents Automate LLM Post-Training?"?

Estimated time to first reproduction: a few days. Risk flags: No CI workflows detected, Dependency manifest is missing. Start with aisa-group/PostTrainBench and validate setup instructions in README.

Are there pretrained models available for "PostTrainBench: Can LLM Agents Automate LLM Post-Training?"?

Yes, 3 Hugging Face models found. The top result is deepseek-ai/deepseek-llm-7b-chat with 58,704 downloads.

PostTrainBench: Can LLM Agents Automate LLM Post-Training?

Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko

Published: Mar 9, 2026

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 2

Top repo stars: 308

Core AI workload signals detected from paper context and implementation/artifact evidence.

Time to first repro: a few days

2 risk flags

arXiv PDF

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents ca ...

Read full abstract

n perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.

Technical details

Canonical key: arxiv-2603.08640

Cache status: Fresh

Generated at: May 3, 2026, 9:17 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: ready

LLM model: openai/gpt-5.1-20251113

LLM generated: May 2, 2026, 5:21 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_6], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_15], evidencePack.paperSections[id=paper_table_3], evidencePack.paperSections[id=paper_caption_13], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], researcherSummary.hardwareNotes[0], summary.hasReliableImplementation, summary.visibleRepoCount, researcherSummary.implementationRecommendation, guidance.riskFlags[0]

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

2 risk flags

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Instruction tuning

Qwen3-4B-Base

GSM8K

74.4

Source: paper fulltext

Instruction tuning

Qwen3-1.7B-Base

AIME 2025

5.3

Source: paper fulltext

Instruction tuning

GSM8K

Accuracy

5.2

Source: paper fulltext

Instruction tuning

HumanEval

Accuracy

20.4

Source: paper fulltext

Benchmark evidence drill-down

4 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Instruction tuning	Qwen3-4B-Base	GSM8K	74.4	paper-derived	No explicit refs
Instruction tuning	Qwen3-1.7B-Base	AIME 2025	5.3	paper-derived	No explicit refs
Instruction tuning	GSM8K	Accuracy	5.2	paper-derived	No explicit refs
Instruction tuning	HumanEval	Accuracy	20.4	paper-derived	No explicit refs

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities.

Use This Implementation Because…

Confidence: medium

aisa-group/PostTrainBench is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (MIT).

Open aisa-group/PostTrainBench

Reproduction Risks

No CI workflows detected
Dependency manifest is missing

Hardware Notes

We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU).

Evidence disclosure

Evidence graph: 4 refs, 4 links.

Utility signals: depth 100/100, grounding 95/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

aisa-group/PostTrainBench

best maintained

Maintenance: Active

Confidence: Medium

Reproducibility: Limited

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 308
Last push: Apr 27, 2026 (6d ago)

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

josancamon19/PostTrainBench

alternative

Maintenance: Active

Confidence: Medium

Reproducibility: Strong

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 0
Last push: Apr 26, 2026 (7d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup

eerstar/LLM-Agent-paper-daily

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search

Stars: 0
Last push: May 3, 2026 (1d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup
Low confidence match

Best implementation now

aisa-group/PostTrainBench

Confidence: Medium

Reproducibility: Limited

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

Stars: 308

Forks: 33

Last push: Apr 27, 2026

License: MIT

Matched via arXiv identifier search

Strong overlap with paper title keywords

Community adoption signal (308 stars)

License ✓

CI –

Deps –

Docker –

Selected aisa-group/PostTrainBench as the strongest maintained implementation for new work.
Repository activity is within the last 24 months.

Reproduction readiness

Major Work

Time to first repro: days

Last checked: May 3, 2026

Hardware requirements

We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU).

No dependency manifest — manual reconstruction required

· aisa-group/PostTrainBench has no requirements.txt, environment.yml, pyproject.toml, or Dockerfile.
· You will need to reverse-engineer dependencies from import statements in the source code.

Open aisa-group/PostTrainBench

Additional implementations

Official

No additional official repositories detected.

Community

josancamon19/PostTrainBench
Confidence: Medium

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

Stars: 0

Last push: Apr 26, 2026

License: MIT

Possible but unverified matches (1)

These repositories had low-confidence matching signals and are hidden by default.

eerstar/LLM-Agent-paper-daily

Confidence: Low

Stars: 0

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

deepseek-ai/deepseek-llm-7b-chat

Curated Related

Downloads: 58,704

Likes: 222
deepseek-ai/deepseek-llm-7b-base

Curated Related

Downloads: 40,562

Likes: 141
BAAI/llm-embedder

Curated Related

Downloads: 32,241

Likes: 128