What is the best open-source implementation of "Transformer in Transformer"?

The best maintained implementation is huawei-noah/CV-backbones with 4,416 stars on GitHub. Confidence: high. Reproducibility: Limited.

How reproducible is "Transformer in Transformer"?

Estimated time to first reproduction: a few days. Risk flags: License metadata missing, No CI workflows detected, Dependency manifest is missing. Start with huawei-noah/CV-backbones and validate setup instructions in README.

What framework is used to implement "Transformer in Transformer"?

The primary implementation uses tf.

Transformer in Transformer

Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, Yunhe Wang

Published: Feb 27, 2021

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 2

Top repo stars: 4,416

Core AI workload signals detected from paper context and implementation/artifact evidence.

Framework: tf

Time to first repro: a few days

3 risk flags

arXiv PDF

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fin ...

Read full abstract

e enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an 81.5% top-1 accuracy on the ImageNet, which is about 1.7% higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones, and the MindSpore code is available at https://gitee.com/mindspore/models/tree/master/research/cv/TNT.

Technical details

Canonical key: arxiv-2103.00112

Cache status: Fresh

Generated at: Jun 18, 2026, 7:53 AM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: Yes

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: thin evidence

Time to repro: a few days

3 risk flags

Results & Benchmarks

Freshness tier: cold

Direct + Inferred Evidence

Image classification

ImageNet

Top-1 Accuracy

94.1

Source: paper fulltext

Language modeling

ImageNet

54.7

Source: paper fulltext

Language modeling

COCO

58.1

Source: paper fulltext

Benchmark evidence drill-down

3 findings

Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.

Task	Dataset	Metric	Value	Source	Evidence refs
Image classification	ImageNet	Top-1 Accuracy	94.1	paper-derived	No explicit refs
Language modeling	ImageNet	AP	54.7	paper-derived	No explicit refs
Language modeling	COCO	AP	58.1	paper-derived	No explicit refs

Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism.

Use This Implementation Because…

Confidence: high

huawei-noah/CV-backbones is the strongest maintained implementation based on ranking signals.

Open huawei-noah/CV-backbones

Reproduction Risks

License metadata missing
No CI workflows detected
Dependency manifest is missing

Hardware Notes

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Evidence disclosure

Evidence graph: 3 refs, 3 links.

Utility signals: depth 100/100, grounding 85/100, status high.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

huawei-noah/CV-backbones

best maintained

Maintenance: Stale

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Repository link is mentioned in the paper metadata

Stars: 4,416
Last push: Mar 15, 2025 (460d ago)

Releases

Risk flags

No push in 12+ months
No CI pipeline detected
No Docker setup

huawei-noah/CV-Backbones

historical official

Maintenance: Stale

Confidence: High

Reproducibility: Limited

Official implementation from Papers with Code · Community adoption signal (4416 stars)

Stars: 4,416
Last push: Mar 15, 2025 (460d ago)

Releases

Risk flags

No push in 12+ months
No CI pipeline detected
No Docker setup

PaddlePaddle/PaddleClas

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Strong

Community adoption signal (5817 stars)

Stars: 5,817
Last push: Jun 16, 2026 (2d ago)

CIReleasesDependencies

Risk flags

No Docker setup
Low confidence match

Best implementation now

huawei-noah/CV-backbones

Confidence: High

Reproducibility: Limited

Efficient AI Backbones including GhostNet, TNT and MLP, developed by Huawei Noah's Ark Lab.

Stars: 4,416

Forks: 736

Last push: Mar 15, 2025

Official implementation from Papers with Code

Repository link is mentioned in the paper metadata

Community adoption signal (4416 stars)

License –

CI –

Deps –

Docker –

Selected huawei-noah/CV-backbones as the strongest maintained implementation for new work.
Repository activity is within the last 24 months.
Official repository is preserved separately as historical context.

Historical official implementation

Preserved for provenance. Not recommended as the default path for new builds.

huawei-noah/CV-Backbones

Stars: 4,416

Last push: Mar 15, 2025

Reproduction readiness

Major Work

Time to first repro: days

Last checked: Jun 18, 2026

Hardware requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

No dependency manifest — manual reconstruction required

· huawei-noah/CV-backbones has no requirements.txt, environment.yml, pyproject.toml, or Dockerfile.
· You will need to reverse-engineer dependencies from import statements in the source code.
· Last push was 460 days ago.

Open huawei-noah/CV-backbones

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (6)

These repositories had low-confidence matching signals and are hidden by default.

PaddlePaddle/PaddleClas

Confidence: Low

Stars: 5,817
open-mmlab/mmclassification

Confidence: Low

Stars: 3,843
lucidrains/transformer-in-transformer

Confidence: Low

Stars: 306
Rishit-dagli/Transformer-in-Transformer

Confidence: Low

Stars: 43
NZ99/transformer_in_transformer_flax

Confidence: Low

Stars: 21
mindspore-ai/models

Confidence: Low

Stars: 365

Hugging Face artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches derived from the paper title and method context:

Models

arxiv:2103.00112 Transformer in Transformer Transformer

Datasets

arxiv:2103.00112 Image classification dataset Transformer benchmark

Spaces

arxiv:2103.00112 Image classification demo Transformer gradio

Tip: start with models, then check datasets/spaces if you need evaluation data or demos.

Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.

Search models Search datasets Search spaces

Research context

Tasks

Image classification

Methods

Transformer

Domains

Computer vision

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Image classification Transformer Computer vision

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote