OpenTrain AI
No verified implementation yet

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Inclusion AI, :, Bowen Ma, Cheng Zou, ChengKun Du +71 more

October 28, 2025arXiv: 2510.24821
0 repos~a few days to reproduce
arXiv PDF

Abstract

We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence acr...

Summary

Ming-Flash-Omni is built as a sparse Mixture-of-Experts variant of Ling-Flash-2.0 with 100B total parameters, of which only 6.1B are active per token, enabling a unified architecture for vision, speech, and language. This page includes benchmark evidence for On StreamingMultiturnBench, the smaller Ming-Lite-Omni model attains an average on Ming-Lite-Omni. Reproduction guidance focuses on implementation viability and concrete risk controls.

Key Contributions

  • Ming-Flash-Omni is built as a sparse Mixture-of-Experts variant of Ling-Flash-2.0 with 100B total parameters, of which only 6.1B are active per token, enabling a unified architecture for vision, speech, and language.
  • The Ming-Flash-Omni model is designed as a single unified system that supports multimodal perception and generation, including vision-language understanding, text-to-image generation, image editing, and contextual.
  • Ming-Flash-Omni supports joint, continuous generation of speech, sound, and music, extending its speech capabilities beyond ASR to multimodal audio synthesis.
  • The vision component of Ming-Flash-Omni introduces generative semantic segmentation aimed at competitive standalone segmentation performance and improved spatial control and editing consistency.
  • Ming-Flash-Omni is designed for multi-turn interactions where it can seamlessly switch among different multimodal tasks within a single session.

Reproducibility Notes

  • Estimate is based on paper-only reproduction flow.

Results & Benchmarks

TaskDatasetMetricValue
Computer visionMing-Lite-OmniAccuracy44.63

Hardware Requirements

  • Expect multi-day setup/compute for meaningful reproduction based on current guidance.

Best Implementation

Maintained implementation evidence is not confirmed for this paper yet.

Use the Implementation Status and Reproduction Path sections below for the current action plan.

Reproduction Path

Follow this baseline workflow to decide if this paper is worth immediate prototyping.

  1. 1

    Use the paper and benchmark evidence to scope a baseline reproduction plan.

  2. 2

    Track assumptions and missing details in an experiment log before coding.

Time to first repro: a few daysEstimate is based on paper-only reproduction flow

Additional Implementations

No additional verified repositories beyond the primary recommendation.

Hugging Face Artifacts

No trustworthy direct or curated related Hugging Face artifacts were found yet.

Research Context