OpenTrain AI
Maintained implementation availablepytorchPretrained Models Available

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Wightman, Jakob Uszkoreit +1 more

June 18, 2021arXiv: 2106.10270
3 repos36,616 stars~a few hours to reproduce
arXiv PDF

Abstract

Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when...

Results & Benchmarks

TaskDatasetMetricValue
Image classificationCIFAR-100Accuracy100

Best Implementation

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

36.6k 5.1k Apr 2026 Apache-2.0
License
CI
Deps
Docker
  • Selected rwightman/pytorch-image-models as the strongest maintained implementation for new work.
  • Includes CI workflow signals.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Reproduction Path

  1. 1

    Start with rwightman/pytorch-image-models and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo repository-level red flags were detected, but paper-specific preprocessing and hyperparameter details may still be under-specified.

Additional Implementations

Official

  • Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.

    Stars: 3.4kForks: 220Last push: May 2025License: Apache-2.0

Community

No additional community repositories detected yet.

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Research Context