Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Natural language processing | Codex-S-12B | pass@1 | 32.2 |
| Natural language processing | Codex-D-12B | pass@1 | 20.3 |
| Natural language processing | HumanEval | BLEU | 0.8 |
Best Implementation
Code for the paper "Evaluating Large Language Models Trained on Code"
3.2k 442 Jan 2025 MIT
License ✓
CI –
Deps ✓
Docker –
- Selected openai/human-eval as the strongest maintained implementation for new work.
- Includes dependency/environment manifest signals.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with openai/human-eval and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Time to first repro: a few hoursNo CI workflows detected
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.
Curated Related