Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
- Stars
- 28
- Last push
- Feb 28, 2026 (8d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Zhexu Wang, Yiping Liu, Yejie Wang, Wenyang He, Bofei Gao, Muxi Diao, Yanxu Chen, Kelin Fu, Flood Sung, Zhilin Yang, Tianyu Liu, Weiran Xu
Core AI workload signals detected from paper context and implementation/artifact evidence.
Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these capabilities, particularly at the competitive level. To bridge this gap, we introduce OJBench, a novel and challenging benchmark designed to assess the competitive-level code reasoning ab ...
ilities of LLMs. OJBench comprises 232 programming competition problems from NOI and ICPC, providing a more rigorous test of models' reasoning skills. We conducted a comprehensive evaluation using OJBench on 37 models, including both closed-source and open-source models, reasoning-oriented and non-reasoning-oriented models. Our results indicate that even state-of-the-art reasoning-oriented models, such as o4-mini and Gemini-2.5-pro-exp, struggle with highly challenging competition-level problems. This highlights the significant challenges that models face in competitive-level code reasoning.
Researcher verdict
This page is best used as an implementation starting point. A concrete repo path exists, but the overall evidence is not strong enough yet to treat it as a plug-and-play baseline.
Why this page is still worth reading
Benchmark trust
Concrete benchmark findings are present and can be audited against the extracted evidence.
Use this page as
Use this page to start from the best available repo path, but validate benchmark claims separately before treating it as a trusted baseline.
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Model comparisons show that state-of-the-art reasoning-oriented LLMs obtain subs | OJBench | pass@1 | 63.70 | llm-grounded | paper.abstractevidencePack.paperSections[id=paper_table_2]evidencePack.paperSections[id=paper_caption_4] |
| Natural language processing | o4-mini(low) | pass@1 | 63.70 | llm-grounded | researcherSummary.benchmarkSnapshot[0] |
| Natural language processing | o4-mini | pass@1 | 33.30 | llm-grounded | researcherSummary.benchmarkSnapshot[1] |
| Natural language processing | Gemini-2.5-pro | pass@1 | 65.90 | llm-grounded | researcherSummary.benchmarkSnapshot[2] |
Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities.
he-ren/ojbench is the strongest maintained implementation based on ranking signals. License is declared (AGPL-3.0). Dependency/environment manifests are present.
Open he-ren/ojbenchLLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_8], evidencePack.paperSections[id=paper_table_2], evidencePack.paperSections[id=paper_caption_4], researcherSummary.benchmarkSnapshot[0], researcherSummary.benchmarkSnapshot[1], researcherSummary.benchmarkSnapshot[2]
Evidence graph: 3 refs, 3 links.
Utility signals: depth 90/100, grounding 85/100, status high.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
This page is not strong enough for a full AI-written research brief yet, so the summary is reduced to what is evidenced, what is missing, and what to do next.
What is known
What is missing
What to do next
He-Ren/OJBench
Follow the direct implementation path
Start with he-ren/ojbench and validate setup instructions in README.
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
Log exact dependency versions and runtime environment for reproducibility.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Datasets
Spaces
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Natural language processing
Methods
Transformer
Domains
Natural Language Processing
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.