Skip to content
implementation starting point
Benchmarks: thin evidence
Time to repro: a few days
2 risk flags

Results & Benchmarks

Freshness tier: hot
Direct + Inferred Evidence
Instruction tuning
Qwen3-4B-Base
GSM8K
74.4
Source: paper fulltext
Instruction tuning
Qwen3-1.7B-Base
AIME 2025
5.3
Source: paper fulltext
Instruction tuning
GSM8K
Accuracy
5.2
Source: paper fulltext
Instruction tuning
HumanEval
Accuracy
20.4
Source: paper fulltext

Benchmark evidence drill-down

4 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task Dataset Metric Value Source Evidence refs
Instruction tuning Qwen3-4B-Base GSM8K 74.4 paper-derived No explicit refs
Instruction tuning Qwen3-1.7B-Base AIME 2025 5.3 paper-derived No explicit refs
Instruction tuning GSM8K Accuracy 5.2 paper-derived No explicit refs
Instruction tuning HumanEval Accuracy 20.4 paper-derived No explicit refs

AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities.

Use This Implementation Because…

Confidence: medium

aisa-group/PostTrainBench is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (MIT).

Open aisa-group/PostTrainBench

Reproduction Risks

  • No CI workflows detected
  • Dependency manifest is missing

Hardware Notes

We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU).

Evidence disclosure

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_6], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_15], evidencePack.paperSections[id=paper_table_3], evidencePack.paperSections[id=paper_caption_13], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], researcherSummary.hardwareNotes[0], summary.hasReliableImplementation, summary.visibleRepoCount, researcherSummary.implementationRecommendation, guidance.riskFlags[0]

Evidence graph: 4 refs, 4 links.

Utility signals: depth 100/100, grounding 95/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

aisa-group/PostTrainBench
best maintained
Maintenance: Active
Confidence: Medium
Reproducibility: Limited

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars
308
Last push
Apr 27, 2026 (6d ago)

Risk flags

  • No CI pipeline detected
  • No tagged releases
  • No Docker setup
Maintenance: Active
Confidence: Medium
Reproducibility: Strong

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars
0
Last push
Apr 26, 2026 (7d ago)
CIDependencies

Risk flags

  • No tagged releases
  • No Docker setup
Maintenance: Active
Confidence: Low
Reproducibility: Moderate

Matched via arXiv identifier search

Stars
0
Last push
May 3, 2026 (1d ago)
CIDependencies

Risk flags

  • No tagged releases
  • No Docker setup
  • Low confidence match

Best implementation now

aisa-group/PostTrainBench
Confidence: Medium
Reproducibility: Limited

Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

Stars: 308
Forks: 33
Last push: Apr 27, 2026
License: MIT
Matched via arXiv identifier search
Strong overlap with paper title keywords
Community adoption signal (308 stars)
License ✓
CI –
Deps –
Docker –
  • Selected aisa-group/PostTrainBench as the strongest maintained implementation for new work.
  • Repository activity is within the last 24 months.

Reproduction readiness

Major Work
Time to first repro: days
Last checked: May 3, 2026

Hardware requirements

  • We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU).

No dependency manifest — manual reconstruction required

  • · aisa-group/PostTrainBench has no requirements.txt, environment.yml, pyproject.toml, or Dockerfile.
  • · You will need to reverse-engineer dependencies from import statements in the source code.
Open aisa-group/PostTrainBench

Additional implementations

Official

No additional official repositories detected.

Community

  • Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours

    Stars: 0
    Last push: Apr 26, 2026
    License: MIT

These repositories had low-confidence matching signals and are hidden by default.

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Research context

Tasks

Instruction tuning, Agentic tool use

Methods

Transformer

Domains

Large Language Models, AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.