Maintained implementation available

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi +16 more

January 15, 2026arXiv: 2601.10611

2 repos485 stars~a few hours to reproduce

Abstract

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications requi...

Best Implementation

allenai/molmo2

Code for the Molmo2 Vision-Language Model

485 33 Mar 2026 Apache-2.0

License ✓

CI –

Deps ✓

Docker ✓

Selected allenai/molmo2 as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction Path

1
Start with allenai/molmo2 and validate setup instructions in README.
2
Reproduce the baseline result with the provided defaults before modifying hyperparameters.
3
Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo CI workflows detected

Additional Implementations

Official

No additional official repositories detected.

Community

harpreetsahota204/fiftyone_video_workshopConfidence: low
Materials for Workshop: Exploring Video Datasets with FiftyOne and Vision-Language Models
Stars: 5Forks: 2Last push: Feb 2026

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Continue with targeted Hugging Face searches:

models

arxiv:2601.10611 Vision-Language Molmo2

datasets

arxiv:2601.10611 Vision-Language dataset Video understanding / reasoning dataset

spaces

arxiv:2601.10611 Vision-Language demo Video understanding / reasoning demo

Research Context