Skip to content
implementation starting point
Benchmarks: thin evidence
Time to repro: a few hours

Results & Benchmarks

Direct + Inferred Evidence
Multi-turn agent success rate improvement over vanilla OPD
ALFWorld, WebShop, ScienceWorld
Success rate gain
up to 18 points
Source: llm grounded

Benchmark evidence drill-down

1 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task Dataset Metric Value Source Evidence refs
Multi-turn agent success rate improvement over vanilla OPD ALFWorld, WebShop, ScienceWorld Success rate gain up to 18 points llm-grounded No explicit refs

On-policy distillation (OPD) has shown strong potential for transferring reasoning ability from frontier or domain-specific models to smaller students.

Use This Implementation Because…

Confidence: medium

kokolerk/TCOD is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. CI workflows are present. License is declared (Apache-2.0).

Open kokolerk/TCOD

Reproduction Risks

  • No repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.
Evidence disclosure

LLM evidence refs: paper.abstract, evidencePack.paperSections[id=paper_table_1], evidencePack.paperSections[id=paper_caption_2], evidencePack.paperSections[id=paper_caption_3], evidencePack.paperSections[id=paper_caption_4], evidencePack.paperSections[id=paper_14], evidencePack.paperSections[id=paper_caption_10], evidencePack.paperSections[id=paper_table_5], researcherSummary.implementationRecommendation, guidance.riskFlags[0], guidance.riskFlags[1], researcherSummary.hardwareNotes[0], researcherSummary.timeToFirstMeaningfulRun, paper.title, summary.hasReliableImplementation

Evidence graph: 4 refs, 4 links.

Utility signals: depth 55/100, grounding 85/100, status medium.

Implementation Comparison

Top 2 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

kokolerk/TCOD
best maintained
Maintenance: Active
Confidence: Medium
Reproducibility: Strong

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars
21
Last push
Apr 29, 2026 (2d ago)
CIDependencies

Risk flags

  • No tagged releases
  • No Docker setup
Valiant-Cat/hfpaper
alternative
Maintenance: Active
Confidence: Low
Reproducibility: Moderate

Matched via arXiv identifier search

Stars
0
Last push
Apr 29, 2026 (1d ago)
CI

Risk flags

  • No tagged releases
  • No Docker setup
  • Dependency manifest missing

Best implementation now

kokolerk/TCOD
Confidence: Medium
Reproducibility: Strong

TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents

Stars: 21
Forks: 1
Last push: Apr 29, 2026
License: Apache-2.0
Matched via arXiv identifier search
Strong overlap with paper title keywords
License ✓
CI ✓
Deps ✓
Docker –
  • Selected kokolerk/TCOD as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Reproduction readiness

Ready to Run
Time to first repro: hours
Last checked: Apr 30, 2026

Ready to reproduce

  • · Clone kokolerk/TCOD and install dependencies from pyproject.toml.
  • · CI pipeline detected — automated tests are in place.
  • · Last updated 2 days ago.
Open kokolerk/TCOD

Quick start

git clone https://github.com/kokolerk/TCOD.git
pip install -e .

Additional implementations

No additional verified repositories beyond the primary recommendation.

These repositories had low-confidence matching signals and are hidden by default.

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Research context

Tasks

Agentic tool use

Methods

Agentic systems

Domains

AI Agents

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.