Matched via arXiv identifier search · Strong overlap with paper title keywords
- Stars
- 858
- Last push
- Apr 3, 2026 (59d ago)
Risk flags
- No CI pipeline detected
- No tagged releases
- No Docker setup
Tianyuan Yuan, Zibin Dong, Yicheng Liu, Hang Zhao
Paper appears method- or tooling-adjacent to AI workflows with partial ecosystem coverage.
World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action. Most existing WAMs follow an imagine-then-execute paradigm, incurring substantial test-time latency from iterative video denoising, yet it remains unclear whether explicit future imagination is actually necessary f ...
or strong action performance. In this paper, we ask whether WAMs need explicit future imagination at test time, or whether their benefit comes primarily from video modeling during training. We disentangle the role of video modeling during training from explicit future generation during inference by proposing \textbf{Fast-WAM}, a WAM architecture that retains video co-training during training but skips future prediction at test time. We further instantiate several Fast-WAM variants to enable a controlled comparison of these two factors. Across these variants, we find that Fast-WAM remains competitive with imagine-then-execute variants, while removing video co-training causes a much larger performance drop. Empirically, Fast-WAM achieves competitive results with state-of-the-art methods both on simulation benchmarks (LIBERO and RoboTwin) and real-world tasks, without embodied pretraining. It runs in real time with 190ms latency, over 4$\times$ faster than existing imagine-then-execute WAMs. These results suggest that the main value of video prediction in WAMs may lie in improving world representations during training rather than generating future observations at test time. Project page: https://yuantianyuan01.github.io/FastWAM/
No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.
World Action Models (WAMs) have emerged as a promising alternative to Vision-Language-Action (VLA) models for embodied control because they explicitly model how visual observations may evolve under action.
yuantianyuan01/FastWAM is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (NOASSERTION). Dependency/environment manifests are present.
Open yuantianyuan01/FastWAMEvidence graph: 3 refs, 3 links.
Utility signals: depth 55/100, grounding 75/100, status medium.
Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.
Matched via arXiv identifier search · Strong overlap with paper title keywords
Risk flags
Matched via arXiv identifier search
Risk flags
Matched via arXiv identifier search · Partial overlap with paper title keywords
Risk flags
Official codebase for Fast-WAM: Do World Action Models Need Test-time Future Imagination?
Dependencies pinned, manual setup needed
Quick start
git clone https://github.com/yuantianyuan01/FastWAM.git
pip install -e . No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.
No additional verified repositories beyond the primary recommendation.
These repositories had low-confidence matching signals and are hidden by default.
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches derived from the paper title and method context:
Datasets
Spaces
Tip: start with models, then check datasets/spaces if you need evaluation data or demos.
Direct artifact matches are currently sparse. Use targeted Hugging Face searches to quickly locate candidate models, datasets, and demos.
Tasks
Scientific computing
Methods
Transformer
Domains
Computer vision
Evaluation & Human Feedback Data
Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.
Open in HFEPXExplore Similar Papers
Jump to Paper2Code search queries derived from this paper's research context.
Need human evaluators for your AI research? Scale annotation with expert AI Trainers.