What is the best open-source implementation of "Self-Instruct: Aligning Language Models with Self-Generated Instructions"?

The best maintained implementation is tatsu-lab/stanford_alpaca with 30,252 stars on GitHub. Confidence: high. Reproducibility: Moderate.

What framework is used to implement "Self-Instruct: Aligning Language Models with Self-Generated Instructions"?

The primary implementation uses pytorch.

Self-Instruct: Aligning Language Models with Self-Generated Instructions

Q: How reproducible is "Self-Instruct: Aligning Language Models with Self-Generated Instructions"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with tatsu-lab/stanford_alpaca and validate setup instructions in README.

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi

Published: Dec 20, 2022

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 2

Top repo stars: 30,252

Core AI workload signals detected from paper context and implementation/artifact evidence.

Framework: pytorch

Time to first repro: a few hours

1 risk flag

arXiv PDF

Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improving the instruction-f ...

Read full abstract

ollowing capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying our method to the vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT-001, which was trained with private user data and human annotations. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT-001. Self-Instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and we release our large synthetic dataset to facilitate future studies on instruction tuning. Our code and data are available at https://github.com/yizhongw/self-instruct.

Technical details

Canonical key: arxiv-2212.10560

Cache status: Fresh

Generated at: May 12, 2026, 6:01 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few hours

1 risk flag

pytorch

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

Instruction tuning

T5-LM

ROUGE-L.

25.7

Source: paper fulltext

Instruction tuning

GPT3

ROUGE-L.

6.8

Source: paper fulltext

Instruction tuning

T 0 0

ROUGE-L.

33.1

Source: paper fulltext

Benchmark evidence drill-down

3 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Instruction tuning	T5-LM	ROUGE-L.	25.7	paper-derived	No explicit refs
Instruction tuning	GPT3	ROUGE-L.	6.8	paper-derived	No explicit refs
Instruction tuning	T 0 0	ROUGE-L.	33.1	paper-derived	No explicit refs

Large "instruction-tuned" language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks.

Use This Implementation Because…

Confidence: high

tatsu-lab/stanford_alpaca is the strongest maintained implementation based on ranking signals. License is declared (Apache-2.0). Dependency/environment manifests are present.

Open tatsu-lab/stanford_alpaca

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 90/100, grounding 85/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

tatsu-lab/stanford_alpaca

best maintained

Maintenance: Stale

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 30,252
Last push: Jul 17, 2024 (664d ago)

Dependencies

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

yizhongw/self-instruct

historical official

Maintenance: Stale

Confidence: High

Reproducibility: Moderate

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 4,600
Last push: Mar 27, 2023 (1142d ago)

Dependencies

Risk flags

No push in 12+ months
No CI pipeline detected
No tagged releases

camel-ai/camel

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Strong

Community adoption signal (16919 stars)

Stars: 16,919
Last push: May 11, 2026 (1d ago)

CIReleasesDependencies

Risk flags

No Docker setup
Low confidence match

Best implementation now

tatsu-lab/stanford_alpaca

Confidence: High

Reproducibility: Moderate

Code and documentation to train Stanford's Alpaca models, and generate the data.

Stars: 30,252

Forks: 3,995

Last push: Jul 17, 2024

License: Apache-2.0

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Matched via arXiv identifier search

Community adoption signal (30252 stars)

License ✓

CI –

Deps ✓

Docker –

Selected tatsu-lab/stanford_alpaca as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.
Official repository is preserved separately as historical context.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

yizhongw/self-instruct

Stars: 4,600

Last push: Mar 27, 2023

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: May 12, 2026

Dependencies pinned, manual setup needed

· tatsu-lab/stanford_alpaca has requirements.txt but requires manual environment setup.
· Last push was 664 days ago — expect possible dependency version conflicts.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open tatsu-lab/stanford_alpaca

Quick start

git clone https://github.com/tatsu-lab/stanford_alpaca.git
pip install -r requirements.txt

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (8)

These repositories had low-confidence matching signals and are hidden by default.

Showing top 6 by score. 2 additional low-confidence matches are hidden.

camel-ai/camel

Confidence: Low

Stars: 16,919
facico/chinese-vicuna

Confidence: Low

Stars: 4,126
beomi/koalpaca

Confidence: Low

Stars: 1,578
databrickslabs/dolly

Confidence: Low

Stars: 10,792
daniel-furman/sft-demos

Confidence: Low

Stars: 78
fsoft-ai4code/codecapybara

Confidence: Low

Stars: 172

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2212.10560 Self-Instruct Self-Generated

Datasets

arxiv:2212.10560 Self-Instruct dataset Instruction tuning dataset

Spaces

arxiv:2212.10560 Self-Instruct demo Instruction tuning demo

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Tasks

Instruction tuning

Methods

Transformer

Domains

Natural Language Processing, Large Language Models

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Instruction tuning Transformer Natural Language Processing Large Language Models

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote