Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
- Stars
- 39
- Last push
- Feb 9, 2026 (71d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Yongchao Chen, Yueying Liu, Junwei Zhou, Yilun Hao, Jingquan Wang, Yang Zhang, Na Li, Chuchu Fan
Core AI workload signals detected from paper context and implementation/artifact evidence.
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. Unlike prior RL + tool-use efforts focused on narrow domains such a ...
s math or retrieval, we curate 144 diverse reasoning and planning tasks and show that training a general-purpose Code Interpreter across them presents significant challenges due to task heterogeneity and scarcity of effective samples. To address this, we introduce a multi-stage curriculum learning approach that partitions training samples by measured improvement potential. The RL training prioritizes samples with higher potential and gradually shifts to lower-potential ones, increasing the average RL gains from merely +3.4% to +9.3% across Qwen-2.5 models (3/7/14B). Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%). Notably, R1-CI-14B also exhibits emergent self-checking behavior through code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.
Practical guidance on training Large Language Models (LLMs) to leverage Code Interpreter across diverse tasks remains lacking.
yongchao98/r1-code-interpreter is the strongest maintained implementation based on ranking signals.
Open yongchao98/r1-code-interpreterHardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence graph: 3 refs, 3 links.
Utility signals: depth 85/100, grounding 75/100, status high.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Official implementation from Papers with Code · Repository link is mentioned in the paper metadata
Risk flags
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Hardware requirements
No dependency manifest — manual reconstruction required
No additional official repositories detected.
R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
No trustworthy model matches right now.
Search models on Hugging FaceBroaden dataset search
No trustworthy demo spaces right now.
Search spaces on Hugging FaceTasks
Retrieval / indexing
Methods
Reinforcement learning
Domains
Natural Language Processing, Large Language Models, Information Retrieval
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.