Matched via arXiv identifier search · Strong overlap with paper title keywords
- Stars
- 21
- Last push
- Apr 29, 2026 (2d ago)
Risk flags
- No tagged releases
- No Docker setup
Jiaqi Wang, Wenhao Zhang, Weijie Shi, Yaliang Li, James Cheng
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students. While effective on static single-turn tasks, its behavior in multi-turn agent settings remains underexplored. In this work, we identify a key limitation of vanilla OPD in such settings, which we term Trajectory-Level KL Instability. Specifically, we observe that KL di ...
vergence increases together with a drop in success rate, and even after convergence, the KL remains high, leading to unstable training. This instability arises from inter-turn error compounding: as errors accumulate, the student is driven beyond the teacher's effective support, rendering the supervision signal unreliable. To address this, we propose TCOD (Temporal Curriculum On-Policy Distillation), a simple yet effective framework that controls the trajectory depth exposed to the student and progressively expands it from short to long with a curriculum schedule. Experimental results across four student-teacher pairs on three multi-turn agent benchmarks (ALFWorld, WebShop, ScienceWorld) show that TCOD mitigates KL escalation and enhances KL stability throughout training, improving agent performance by up to 18 points over vanilla OPD. Further evaluations show that TCOD can even surpass the teacher's performance and generalize to tasks on which the teacher fails.
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Multi-turn agent success rate improvement over vanilla OPD | ALFWorld, WebShop, ScienceWorld | Success rate gain | up to 18 points | llm-grounded | No explicit refs |
On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students.
kokolerk/TCOD is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. CI workflows are present. License is declared (Apache-2.0).
Open kokolerk/TCODLLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_table_1], evidencePack.paperSections[id=paper_caption_2], evidencePack.paperSections[id=paper_caption_3], evidencePack.paperSections[id=paper_caption_4], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_caption_10], evidencePack.paperSections[id=paper_table_5], researcherSummary.implementationRecommendation, guidance.riskFlags[0], guidance.riskFlags[1], researcherSummary.hardwareNotes[0], researcherSummary.timeToFirstMeaningfulRun, paper.title, summary.hasReliableImplementation
Evidence graph: 4 refs, 4 links.
Utility signals: depth 55/100, grounding 85/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Matched via arXiv identifier search
Risk flags
TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents
Ready to reproduce
Quick start
git clone https://github.com/kokolerk/TCOD.git
pip install -e . No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.
Broaden model search
Broaden dataset search
Tasks
Agentic tool use
Methods
Agentic systems
Domains
AI Agents
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.