MetaFormer is Actually What You Need for Vision
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, Shuicheng Yan
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the transformers, instead of ...
the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1 % top-1 accuracy, surpassing well-tuned vision transformer/MLP-like baselines DeiT-B/ResMLP-B24 by 0.3%/1.1% accuracy with 35%/52% fewer parameters and 49%/61% fewer MACs. The effectiveness of Pool-Former verifies our hypothesis and urges us to initiate the concept of “MetaFormer”, a general architecture abstracted from transformers without specifying the token mixer. Based on the extensive experiments, we argue that MetaFormer is the key player in achieving superior results for recent transformer and MLP-like models on vision tasks. This work calls for more future research dedicated to improving MetaFormer instead of focusing on the token mixer modules. Additionally, our proposed PoolFormer could serve as a starting baseline for future MetaFormer architecture design.
Results & Benchmarks
Some benchmark signal exists in the extracted evidence, but it is not structured strongly enough yet for a confident benchmark decision.
Transformers have shown great potential in computer vision tasks.
Implementation Evidence Summary
sail-sg/poolformer is the closest maintained adjacent implementation (Strong overlap with paper title keywords). It is not paper-verified; validate algorithm and evaluation setup against the paper before trusting reported metrics. Community adoption signal: 1363 GitHub stars.
Reproduction Risks
- Adjacent implementations are not paper-verified
- Recommended repository is adjacent and not paper-verified.
- Adjacent implementation match confidence is low.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence disclosure
Evidence graph: 3 refs, 3 links.
Utility signals: depth 70/100, grounding 75/100, status medium.
Implementation Status
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
- No maintained paper-verified implementation was found; start with the closest related repositories below.
- Compare repo methods against the paper equations/algorithm before trusting metrics.
- Create a minimal baseline implementation from the paper and use adjacent repos as references.
Reproduction readiness
Hardware requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
No verified implementation available
- · No maintained repository has been identified for this paper. Check adjacent implementations or HF artifacts below.
Framework baselines
- Hugging Face Transformers training guide
Modern transformer training baseline.
- PyTorch nn.Transformer docs
Reference transformer building block implementation.
Closest related implementations
These are not paper-verified. Use them as reference points when no direct implementation is available.
- sail-sg/poolformerAdjacentConfidence: LowStars: 1,363
Strong overlap with paper title keywords
Hugging Face artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Research context
1,132
Citations
94
References
Tasks
Computer science, Neuroscience, Cognitive Neuroscience, Life Sciences
Methods
Transformer
Domains
Computer vision, Artificial intelligence
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Related papers
-
Search on Paper2Code
An Object Detection and Pose Estimation Approach for Position Based Visual Servoing (2017) Semantic similarity
-
Search on Paper2Code
Self-monitoring to improve robustness of 3D object tracking for robotics (2011) Semantic similarity
-
Search on Paper2Code
Foreground object segmentation from binocular stereo video (2005) Semantic similarity
-
Search on Paper2Code
6-DOF object localization by combining monocular vision and robot arm kinematics (2017) Semantic similarity
-
Search on Paper2Code
Tracking in 3D: Image Variability Decomposition for Recovering Object Pose and Illumination (1999) Semantic similarity
-
Search on Paper2Code
Hand-eye calibration using a single image and robotic picking up using images lacking in contrast (2020) Semantic similarity
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.