What is the best open-source implementation of "OJBench: A Competition Level Code Benchmark For Large Language Models"?

The best maintained implementation is he-ren/ojbench with 28 stars on GitHub. Confidence: high. Reproducibility: Moderate.

What framework is used to implement "OJBench: A Competition Level Code Benchmark For Large Language Models"?

The primary implementation uses none.

OJBench: A Competition Level Code Benchmark For Large Language Models

Q: How reproducible is "OJBench: A Competition Level Code Benchmark For Large Language Models"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with he-ren/ojbench and validate setup instructions in README.

Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu

Published: Jun 19, 2025

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 1

Top repo stars: 28

Core AI workload signals detected from paper context and implementation/artifact evidence.

Framework: none

Time to first repro: a few hours

1 risk flag

arXiv PDF

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning ab ...

Read full abstract

ilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.

Technical details

Canonical key: arxiv-2506.16395

Cache status: Stale (SWR served)

Generated at: Mar 8, 2026, 3:12 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: ready

LLM model: openai/gpt-5.1-20251113

LLM generated: Mar 8, 2026, 3:13 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_8], evidencePack.paperSections[id=paper_table_2], evidencePack.paperSections[id=paper_caption_4], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], researcherSummary.benchmarkSnapshot[2]

Researcher verdict

Useful paper, but implementation path is weak

implementation starting point

Benchmark trust: grounded evidence

This page is best used as an implementation starting point. A concrete repo path exists, but the overall evidence is not strong enough yet to treat it as a plug-and-play baseline.

Why this page is still worth reading

Benchmark findings give you an audit trail for validation before picking an implementation path.
A concrete repository path exists via he-ren/ojbench, so this page can act as a practical starting point.
Reproduction risks are surfaced explicitly, which helps decide whether the paper is worth immediate prototyping.

Benchmark trust

Concrete benchmark findings are present and can be audited against the extracted evidence.

Use this page as

Use this page to start from the best available repo path, but validate benchmark claims separately before treating it as a trusted baseline.

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

Natural language processing

o4-mini(low)

pass@1

63.70

Source: paper fulltext

Natural language processing

o4-mini

pass@1

33.30

Source: paper fulltext

Natural language processing

Gemini-2.5-pro

pass@1

65.90

Source: paper fulltext

Natural language processing

Qwen2.5-Coder-7B

Pass Rate @.

4.74

Source: paper fulltext

Benchmark evidence drill-down

4 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Model comparisons show that state-of-the-art reasoning-oriented LLMs obtain subs	OJBench	pass@1	63.70	llm-grounded	paper.abstractevidencePack.paperSections[id=paper_table_2]evidencePack.paperSections[id=paper_caption_4]
Natural language processing	o4-mini(low)	pass@1	63.70	llm-grounded	researcherSummary.benchmarkSnapshot[0]
Natural language processing	o4-mini	pass@1	33.30	llm-grounded	researcherSummary.benchmarkSnapshot[1]
Natural language processing	Gemini-2.5-pro	pass@1	65.90	llm-grounded	researcherSummary.benchmarkSnapshot[2]

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities.

Use This Implementation Because…

Confidence: high

he-ren/ojbench is the strongest maintained implementation based on ranking signals. License is declared (AGPL-3.0). Dependency/environment manifests are present.

Open he-ren/ojbench

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 90/100, grounding 85/100, status high.

Implementation Comparison

Top 1 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

he-ren/ojbench

best maintained

Maintenance: Active

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 28
Last push: Feb 28, 2026 (8d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

What is known right now

Concise audit mode

This page is not strong enough for a full AI-written research brief yet, so the summary is reduced to what is evidenced, what is missing, and what to do next.

What is known

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities.
Benchmark anchor: Natural language processing on o4-mini(low) using pass@1.
Implementation candidate: he-ren/ojbench.

What is missing

Benchmark evidence is not yet strong enough to treat the LLM brief as fully researcher-ready.

What to do next

Start with he-ren/ojbench and validate setup instructions in README.
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
Log exact dependency versions and runtime environment for reproducibility.

Best implementation now

he-ren/ojbench

Confidence: High

Reproducibility: Moderate

He-Ren/OJBench

Stars: 28

Forks: 1

Last push: Feb 28, 2026

License: AGPL-3.0

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Community adoption signal (28 stars)

License ✓

CI –

Deps ✓

Docker –

Selected he-ren/ojbench as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction path

Direct

Follow the direct implementation path

1

Start with he-ren/ojbench and validate setup instructions in README.
2

Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3

Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hours

No CI workflows detected