Matched via arXiv identifier search
- Stars
- 0
- Last push
- Jun 15, 2026 (5d ago)
Risk flags
- No tagged releases
- No Docker setup
- Low confidence match
Yining Li, Dongchen Han, Zeyu Liu, Hanyi Wang, Yulin Wang, Gao Huang
Core AI workload signals detected from paper context and implementation/artifact evidence.
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive. Inheriting weights from pretrained Transformers provides an appealing shortcut, yet the fundamental representational gap between Softmax and linear attention prevents effective weight transfer. In this work, we addre ...
ss this conversion challenge from two perspectives: architectural alignment and representational alignment. We identify Test-Time Training (TTT) as a linear-complexity architecture whose two-layer dynamic formulation is structurally aligned with Softmax attention, enabling direct inheritance of pretrained attention weights. To further align representational properties, including key shift-invariance and locality, we introduce key instance normalization and a lightweight locality enhancement module. We validate our approach by linearizing Stable Diffusion 3.5 and introduce SD3.5-T$^5$ (Transformer To Test Time Training). With only 1 hour of fine-tuning on 4$\times$H20 GPUs, SD3.5-T$^5$ achieves comparable text-to-image quality to the fine-tuned Softmax model, while accelerating inference by 1.32$\times$ and 1.47$\times$ at 1K and 2K resolutions. Code is available at https://github.com/LeapLabTHU/Transformer-to-TTT.
Audit each benchmark finding before selecting an implementation path. Evidence refs map to the disclosure section below.
| Task | Dataset | Metric | Value | Source | Evidence refs |
|---|---|---|---|---|---|
| Classification | ImageNet | Top-1 Accuracy | 300 | paper-derived | No explicit refs |
| Image classification | ImageNet | Accuracy | 69.52 | paper-derived | No explicit refs |
| Image classification | TTT (no locality) | Accuracy | 69.25 | paper-derived | No explicit refs |
| Image classification | + CPE ( x ) (x) | Accuracy | 69.64 | paper-derived | No explicit refs |
| Image classification | + DWC ( v ) (v) | Accuracy | 70.47 | paper-derived | No explicit refs |
While linear-complexity attention mechanisms offer a promising alternative to Softmax attention for overcoming the quadratic bottleneck, training such models from scratch remains prohibitively expensive.
ZacharyMeng/PolaFormer is the closest maintained adjacent implementation (Matches contextual method/domain keyword: transformer). It is not paper-verified; validate algorithm and evaluation setup against the paper before trusting reported metrics. Community adoption signal: 89 GitHub stars.
Hardware Notes
Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Evidence graph: 3 refs, 3 links.
Utility signals: depth 100/100, grounding 85/100, status high.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search
Risk flags
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
There is no verified maintained implementation yet. Use this baseline plan to decide whether to prototype now or defer.
Hardware requirements
No verified implementation available
Framework baselines
Modern transformer training baseline.
Reference transformer building block implementation.
Practical baseline for diffusion model reproduction.
These are not paper-verified. Use them as reference points when no direct implementation is available.
Matches contextual method/domain keyword: transformer
No additional official repositories detected.
[ICML 2026] Official repository of Linearizing Vision Transformer with Test-Time Training
These repositories had low-confidence matching signals and are hidden by default.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Models
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Image classification
Methods
Transformer, Diffusion
Domains
Computer vision
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.