Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation
Inclusion AI, :, Bowen Ma, Cheng Zou, ChengKun Du +71 more
Abstract
We propose Ming-Flash-Omni, an upgraded version of Ming-Omni, built upon a sparser Mixture-of-Experts (MoE) variant of Ling-Flash-2.0 with 100 billion total parameters, of which only 6.1 billion are active per token. This architecture enables highly efficient scaling (dramatically improving computational efficiency while significantly expanding model capacity) and empowers stronger unified multimodal intelligence acr...
Summary
Ming-Flash-Omni is built as a sparse Mixture-of-Experts variant of Ling-Flash-2.0 with 100B total parameters, of which only 6.1B are active per token, enabling a unified architecture for vision, speech, and language. This page includes benchmark evidence for On StreamingMultiturnBench, the smaller Ming-Lite-Omni model attains an average on Ming-Lite-Omni. Reproduction guidance focuses on implementation viability and concrete risk controls.
Key Contributions
- Ming-Flash-Omni is built as a sparse Mixture-of-Experts variant of Ling-Flash-2.0 with 100B total parameters, of which only 6.1B are active per token, enabling a unified architecture for vision, speech, and language.
- The Ming-Flash-Omni model is designed as a single unified system that supports multimodal perception and generation, including vision-language understanding, text-to-image generation, image editing, and contextual.
- Ming-Flash-Omni supports joint, continuous generation of speech, sound, and music, extending its speech capabilities beyond ASR to multimodal audio synthesis.
- The vision component of Ming-Flash-Omni introduces generative semantic segmentation aimed at competitive standalone segmentation performance and improved spatial control and editing consistency.
- Ming-Flash-Omni is designed for multi-turn interactions where it can seamlessly switch among different multimodal tasks within a single session.
Reproducibility Notes
- Estimate is based on paper-only reproduction flow.
Results & Benchmarks
| Task | Dataset | Metric | Value |
|---|---|---|---|
| Computer vision | Ming-Lite-Omni | Accuracy | 44.63 |
Hardware Requirements
- Expect multi-day setup/compute for meaningful reproduction based on current guidance.
Best Implementation
Maintained implementation evidence is not confirmed for this paper yet.
Use the Implementation Status and Reproduction Path sections below for the current action plan.
Reproduction Path
Follow this baseline workflow to decide if this paper is worth immediate prototyping.
- 1
Use the paper and benchmark evidence to scope a baseline reproduction plan.
- 2
Track assumptions and missing details in an experiment log before coding.
Additional Implementations
No additional verified repositories beyond the primary recommendation.
Hugging Face Artifacts
No trustworthy direct or curated related Hugging Face artifacts were found yet.
Continue with targeted Hugging Face searches: