VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu +2 more
Abstract
Large vision-language models exhibit inherent capabilities to handle diverse visual perception tasks. In this paper, we introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks within a shared model. Specifically, by designing a unified reward mechanism and multi-object cognitive learning strategies, VisionReasoner enhances its reasoning capabilities to analyze v...
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Classification | Qwen2.5-1.5B | Accuracy. | 46.3 |
Best Implementation
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
- Selected hiyouga/easyr1 as the strongest maintained implementation for new work.
- Includes CI workflow signals.
- Includes dependency/environment manifest signals.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with hiyouga/easyr1 and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Additional Implementations
Official
- dvlab-research/Seg-ZeroConfidence: low
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
Stars: 621Forks: 29Last push: Jan 2026License: Apache-2.0
Community
- JIA-Lab-research/Seg-ZeroConfidence: low
Project Page For "Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement"
Stars: 621Forks: 29Last push: Jan 2026License: Apache-2.0
Hugging Face Artifacts
No direct paper-linked artifacts were found. Showing strongest curated related artifacts.