Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi +16 more
Abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications requi...
Best Implementation
Code for the Molmo2 Vision-Language Model
- Selected allenai/molmo2 as the strongest maintained implementation for new work.
- Includes dependency/environment manifest signals.
- Repository activity is within the last 24 months.
Reproduction Path
- 1
Start with allenai/molmo2 and validate setup instructions in README.
- 2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
- 3
Log exact dependency versions and runtime environment for reproducibility.
Additional Implementations
Official
No additional official repositories detected.
Community
- harpreetsahota204/fiftyone_video_workshopConfidence: low
Materials for Workshop: Exploring Video Datasets with FiftyOne and Vision-Language Models
Stars: 5Forks: 2Last push: Feb 2026
Hugging Face Artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches: