Matched via arXiv identifier search · Strong overlap with paper title keywords
- Stars
- 308
- Last push
- Apr 27, 2026 (6d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Ben Rank, Hardik Bhatnagar, Ameya Prabhu, Shira Eisenberg, Karina Nguyen, Matthias Bethge, Maksym Andriushchenko
Core AI workload signals detected from paper context and implementation/artifact evidence.
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities. This raises a deeper question: can these systems extend their capabilities to automate AI research itself? In this paper, we explore post-training, the critical phase that turns base LLMs into useful assistants. We introduce PostTrainBench to benchmark how well LLM agents ca ...
n perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU). We ask frontier agents (e.g., Claude Code with Opus 4.6) to optimize the performance of a base LLM on a particular benchmark (e.g., Qwen3-4B on AIME). Importantly, we do not provide any predefined strategies to the agents and instead give them full autonomy to find necessary information on the web, run experiments, and curate data. We find that frontier agents make substantial progress but generally lag behind instruction-tuned LLMs from leading providers: 23.2% for the best agent vs. 51.1% for official instruction-tuned models. However, agents can exceed instruction-tuned models in targeted scenarios: GPT-5.1 Codex Max achieves 89% on BFCL with Gemma-3-4B vs. 67% for the official model. We also observe several failure modes worth flagging. Agents sometimes engage in reward hacking: training on the test set, downloading existing instruction-tuned checkpoints instead of training their own, and using API keys they find to generate synthetic data without authorization. These behaviors are concerning and highlight the importance of careful sandboxing as these systems become more capable. Overall, we hope PostTrainBench will be useful for tracking progress in AI R&D automation and for studying the risks that come with it. Website and code are available at https://posttrainbench.com/.
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Instruction tuning | Qwen3-4B-Base | GSM8K | 74.4 | paper-derived | No explicit refs |
| Instruction tuning | Qwen3-1.7B-Base | AIME 2025 | 5.3 | paper-derived | No explicit refs |
| Instruction tuning | GSM8K | Accuracy | 5.2 | paper-derived | No explicit refs |
| Instruction tuning | HumanEval | Accuracy | 20.4 | paper-derived | No explicit refs |
AI agents have become surprisingly proficient at software engineering over the past year, largely due to improvements in reasoning capabilities.
aisa-group/PostTrainBench is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (MIT).
Open aisa-group/PostTrainBenchHardware Notes
We introduce PostTrainBench to benchmark how well LLM agents can perform post-training autonomously under bounded compute constraints (10 hours on one H100 GPU).
LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_6], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_15], evidencePack.paperSections[id=paper_table_3], evidencePack.paperSections[id=paper_caption_13], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], researcherSummary.hardwareNotes[0], summary.hasReliableImplementation, summary.visibleRepoCount, researcherSummary.implementationRecommendation, guidance.riskFlags[0]
Evidence graph: 4 refs, 4 links.
Utility signals: depth 100/100, grounding 95/100, status high.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Matched via arXiv identifier search
Risk flags
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
Hardware requirements
No dependency manifest — manual reconstruction required
No additional official repositories detected.
Measuring how well CLI agents like Claude Code or Codex CLI can post-train base LLMs on a single H100 GPU in 10 hours
These repositories had low-confidence matching signals and are hidden by default.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
Broaden model search
Tasks
Instruction tuning, Agentic tool use
Methods
Transformer
Domains
Large Language Models, AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.