Visual Planning: Let's Think Only with Images
Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang +2 more
Abstract
Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reason...
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Reinforcement learning | xxx - Direct | EM | 68.6 |
| Reinforcement learning | xxx - w/ Coordinates | EM | 74.4 |
| Reinforcement learning | xxx - w/ ASCII | EM | 73.1 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
[ICLR 2026 Oral] Visual Planning: Let's Think Only with Images
- Selected yix8/visualplanning as the strongest maintained implementation for new work.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with yix8/visualplanning and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Additional Implementations
Official
No additional official repositories detected.
Community
- yix8/VisualPlanningConfidence: low
[ICLR 2026 Oral] Visual Planning: Let's Think Only with Images
Stars: 321Forks: 11Last push: Feb 2026License: MIT
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.