TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Q: What is the best open-source implementation of "TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents"?

The best maintained implementation is kokolerk/TCOD with 21 stars on GitHub. Confidence: medium. Reproducibility: Strong.

Q: How reproducible is "TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents"?

Estimated time to first reproduction: a few hours. No risk flags identified. Start with kokolerk/TCOD and validate setup instructions in README.

Q: Are there pretrained models available for "TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents"?

Yes, 1 Hugging Face model found. The top result is StanShareAI/qwen-stanshare-ft-curriculum-v1 with 239 downloads.

Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng

Published: Apr 27, 2026

Best maintained implementation now

Evidence: Direct

Domain fit: AI-adjacent

Verified repos: 1

Top repo stars: 21

Time to first repro: a few hours

No risk flags

arXiv PDF

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL di ...

Read full abstract

vergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.

Technical details

Canonical key: arxiv-2604.24005

Cache status: Fresh

Generated at: Apr 30, 2026, 2:50 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: ready

LLM model: openai/gpt-5.1-20251113

LLM generated: Apr 30, 2026, 5:25 AM

LLM content type: researcher_benchmark_brief

HF policy: hf-relevance-v27

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_table_1], evidencePack.paperSections[id=paper_caption_2], evidencePack.paperSections[id=paper_caption_3], evidencePack.paperSections[id=paper_caption_4], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_caption_10], evidencePack.paperSections[id=paper_table_5], researcherSummary.implementationRecommendation, guidance.riskFlags[0], guidance.riskFlags[1], researcherSummary.hardwareNotes[0], researcherSummary.timeToFirstMeaningfulRun, paper.title, summary.hasReliableImplementation

implementation starting point

Benchmarks: thin evidence

Time to repro: a few hours

Results & Benchmarks

Direct + Inferred Evidence

Multi-turn agent success rate improvement over vanilla OPD

ALFWorld, WebShop, ScienceWorld

Success rate gain

up to 18 points

Source: llm grounded

Benchmark evidence drill-down

1 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Multi-turn agent success rate improvement over vanilla OPD	ALFWorld, WebShop, ScienceWorld	Success rate gain	up to 18 points	llm-grounded	No explicit refs

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students.

Use This Implementation Because…

Confidence: medium

kokolerk/TCOD is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. CI workflows are present. License is declared (Apache-2.0).

Open kokolerk/TCOD

Reproduction Risks

No repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.

Evidence disclosure

Evidence graph: 4 refs, 4 links.

Utility signals: depth 55/100, grounding 85/100, status medium.

Implementation Comparison

Top 2 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

kokolerk/TCOD

best maintained

Maintenance: Active

Confidence: Medium

Reproducibility: Strong

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 21
Last push: Apr 29, 2026 (2d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup

Valiant-Cat/hfpaper

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search

Stars: 0
Last push: Apr 29, 2026 (1d ago)

Risk flags

No tagged releases
No Docker setup
Dependency manifest missing

Best implementation now

kokolerk/TCOD

Confidence: Medium

Reproducibility: Strong

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Stars: 21

Forks: 1

Last push: Apr 29, 2026

License: Apache-2.0

Matched via arXiv identifier search

Strong overlap with paper title keywords

License ✓

CI ✓

Deps ✓

Docker –

Selected kokolerk/TCOD as the strongest maintained implementation for new work.
Includes CI workflow signals.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction readiness

Ready to Run

Time to first repro: hours

Last checked: Apr 30, 2026

Ready to reproduce

· Clone kokolerk/TCOD and install dependencies from pyproject.toml.
· CI pipeline detected — automated tests are in place.
· Last updated 2 days ago.

Open kokolerk/TCOD

Quick start

git clone https://github.com/kokolerk/TCOD.git
pip install -e .

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (1)

These repositories had low-confidence matching signals and are hidden by default.

Valiant-Cat/hfpaper

Confidence: Low

Stars: 0

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

StanShareAI/qwen-stanshare-ft-curriculum-v1

Curated Related

Downloads: 239

Likes: 1

Broaden model search

Agentic systems Agentic tool use Agentic systems AI Agents Agentic tool use

Datasets

ServiceNow-AI/Curriculum_DPO_preferences

Curated Related

Downloads: 36

Likes: 5

Updated: Nov 13, 2024

Broaden dataset search

Agentic systems Agentic tool use dataset Agentic systems AI Agents dataset Agentic tool use dataset

Spaces

hra/Curriculum-BabyAGI

Curated Related

Likes: 5

Broaden demo search

Agentic systems Agentic tool use demo Agentic systems AI Agents demo Agentic tool use demo

Explore on Hugging Face

Search models Search datasets Search spaces

Research context

Tasks

Agentic tool use

Methods

Agentic systems

Domains

AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Agentic tool use Agentic systems AI Agents

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote