Maintained implementation availablepytorchPretrained Models Available

Visual Planning: Let's Think Only with Images

Yi Xu, Chengzu Li, Han Zhou, Xingchen Wan, Caiqi Zhang +2 more

May 16, 2025arXiv: 2505.11409

2 repos321 stars~a few days to reproduce

Abstract

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reason...

Results & Benchmarks

Task	Dataset	Metric	Value
Reinforcement learning	xxx - Direct	EM	68.6
Reinforcement learning	xxx - w/ Coordinates	EM	74.4
Reinforcement learning	xxx - w/ ASCII	EM	73.1

Hardware Requirements

Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

yix8/visualplanning

[ICLR 2026 Oral] Visual Planning: Let's Think Only with Images

321 11 Feb 2026 MIT

License ✓

CI –

Deps –

Docker –

Selected yix8/visualplanning as the strongest maintained implementation for new work.
Repository activity is within the last 24 months.

Reproduction Path

1
Start with yix8/visualplanning and validate setup instructions in README.
2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3
Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few daysNo CI workflows detectedDependency manifest is missing

Additional Implementations

Official

No additional official repositories detected.

Community

yix8/VisualPlanningConfidence: low
[ICLR 2026 Oral] Visual Planning: Let's Think Only with Images
Stars: 321Forks: 11Last push: Feb 2026License: MIT

Hugging Face Artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts.

Curated Related

Research Context