OpenTrain AI
Maintained implementation available

Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi +16 more

January 15, 2026arXiv: 2601.10611
2 repos485 stars~a few hours to reproduce
arXiv PDF

Abstract

Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications requi...

Best Implementation

Code for the Molmo2 Vision-Language Model

485 33 Mar 2026 Apache-2.0
License
CI
Deps
Docker
  • Selected allenai/molmo2 as the strongest maintained implementation for new work.
  • Includes dependency/environment manifest signals.
  • Repository activity is within the last 24 months.

Reproduction Path

  1. 1

    Start with allenai/molmo2 and validate setup instructions in README.

  2. 2

    Reproduce the baseline result with the provided defaults before modifying hyperparameters.

  3. 3

    Log exact dependency versions and runtime environment for reproducibility.

Time to first repro: a few hoursNo CI workflows detected

Additional Implementations

Official

No additional official repositories detected.

Community

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.