Maintained implementation availablenone

Evaluating Large Language Models Trained on Code

July 1, 2021arXiv: 2107.03374

1 repo3,185 stars~a few hours to reproduce

Abstract

Task	Dataset	Metric	Value
Natural language processing	Codex-S-12B	pass@1	32.2
Natural language processing	Codex-D-12B	pass@1	20.3
Natural language processing	HumanEval	BLEU	0.8

Code for the paper "Evaluating Large Language Models Trained on Code"

3.2k 442 Jan 2025 MIT

License ✓

CI –

Deps ✓

Docker –

Selected openai/human-eval as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

1
Start with openai/human-eval and validate setup instructions in README.
2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3
Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo CI workflows detected

No additional verified repositories beyond the primary recommendation.

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related