Maintained implementation availablejaxPretrained Models Available

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai +7 more

October 22, 2020arXiv: 2010.11929

1 repo12,409 stars~a few days to reproduce

Abstract

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure tra...

Results & Benchmarks

Benchmark data is not yet available for this paper.

Hardware Requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

google-research/vision_transformer

12.4k 1.5k Mar 2026 Apache-2.0

License ✓

CI ✓

Deps –

Docker –

Selected google-research/vision_transformer as the strongest maintained implementation for new work.
Includes CI workflow signals.
Repository activity is within the last 24 months.

Reproduction Path

1
Start with google-research/vision_transformer and validate setup instructions in README.
2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3
Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few daysDependency manifest is missing

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related

Falconsai/nsfw_image_detection
40.0M 1.0k

Research Context