What is the best open-source implementation of "Bernini: Latent Semantic Planning for Video Diffusion"?

The best maintained implementation is bytedance/Bernini with 890 stars on GitHub. Confidence: medium. Reproducibility: Moderate.

How reproducible is "Bernini: Latent Semantic Planning for Video Diffusion"?

Estimated time to first reproduction: a few hours. Risk flags: No CI workflows detected. Start with bytedance/Bernini and validate setup instructions in README.

Are there pretrained models available for "Bernini: Latent Semantic Planning for Video Diffusion"?

Yes, 1 Hugging Face model found. The top result is facebook/mask2former-swin-large-cityscapes-semantic with 143,265 downloads.

Bernini: Latent Semantic Planning for Video Diffusion

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

Published: May 21, 2026

Best maintained implementation now

Evidence: Direct

Domain fit: AI-core

Verified repos: 1

Top repo stars: 890

Core AI workload signals detected from paper context and implementation/artifact evidence.

Time to first repro: a few hours

1 risk flag

arXiv PDF

Multimodal large language models (MLLMs) and diffusion models have each reached remarkable maturity: MLLMs excel at reasoning over heterogeneous multimodal inputs with strong semantic grounding, while diffusion models synthesize images and videos with photorealistic fidelity. We argue that these two families can be unified through a simple division of labor: MLLMs perform semantic planning, while diffusion models ren ...

Read full abstract

der pixels from high-level semantic guidance and low-level visual features. Building on this idea, we propose Bernini, a unified framework for video generation and editing. An MLLM-based planner predicts the target semantic representation directly in the ViT embedding space, and a DiT-based renderer synthesizes pixels conditioned on this plan, augmented by text features and, for editing, source VAE features for detail preservation. Because semantics serve as the interface, the planner and renderer can be trained separately and only lightly co-trained, preserving the pretrained strengths of both components while keeping training efficient. To better handle multiple visual inputs, we introduce Segment-Aware 3D Rotary Positional Embedding (SA-3D RoPE), and further incorporate chain-of-thought reasoning in the planner to better transfer understanding into generation. Bernini achieves state-of-the-art performance across a wide range of video generation and editing benchmarks, with the MLLM's pretrained understanding translating into strong generalization on challenging editing tasks.

Technical details

Canonical key: arxiv-2605.22344

Cache status: Fresh

Generated at: Jun 19, 2026, 9:06 PM

Artifact coverage: direct

HF provider: ok (token)

PWC source used: No

LLM status: not_generated

LLM model: n/a

LLM generated: Unknown

LLM content type: n/a

HF policy: hf-relevance-v27

implementation starting point

Benchmarks: missing

Time to repro: a few hours

1 risk flag

Results & Benchmarks

Freshness tier: hot

Direct + Inferred Evidence

No concrete benchmark grounding is available yet. Treat the page as context or an implementation starting point only.

Use This Implementation Because…

Confidence: medium

bytedance/Bernini is the best available implementation candidate based on ranking signals, but recommendation confidence is not yet high. License is declared (Apache-2.0). Dependency/environment manifests are present.

Open bytedance/Bernini

Reproduction Risks

No CI workflows detected

Evidence disclosure

Evidence graph: 4 refs, 4 links.

Utility signals: depth 55/100, grounding 85/100, status medium.

Implementation Comparison

Top 3 paths

Compare maintenance quality, reproducibility coverage, and evidence confidence before choosing a reproduction baseline.

bytedance/Bernini

best maintained

Maintenance: Active

Confidence: Medium

Reproducibility: Moderate

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 890
Last push: Jun 12, 2026 (8d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

neuregex/ComfyUI-BerniniR

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Strong

Matched via arXiv identifier search · Strong overlap with paper title keywords

Stars: 11
Last push: Jun 13, 2026 (7d ago)

CIDependencies

Risk flags

No tagged releases
No Docker setup
Low confidence match

RH-RunningHub/ComfyUI-RH-Bernini

alternative

Maintenance: Active

Confidence: Low

Reproducibility: Moderate

Matched via arXiv identifier search

Stars: 14
Last push: Jun 3, 2026 (17d ago)

Dependencies

Risk flags

No CI pipeline detected
No tagged releases
No Docker setup

Best implementation now

bytedance/Bernini

Confidence: Medium

Reproducibility: Moderate

Bernini is a unified framework for video generation and editing that combines an MLLM-based semantic planner with a DiT-based renderer.

Stars: 890

Forks: 72

Last push: Jun 12, 2026

License: Apache-2.0

Matched via arXiv identifier search

Strong overlap with paper title keywords

Community adoption signal (890 stars)

License ✓

CI –

Deps ✓

Docker –

Selected bytedance/Bernini as the strongest maintained implementation for new work.
Includes dependency/environment manifest signals.
Repository activity is within the last 24 months.

Reproduction readiness

Setup Required

Time to first repro: hours

Last checked: Jun 19, 2026

Dependencies pinned, manual setup needed

· bytedance/Bernini has pyproject.toml but requires manual environment setup.
· No Dockerfile — you will set up the environment manually.
· No CI pipeline — test coverage is unknown.

Open bytedance/Bernini

Quick start

git clone https://github.com/bytedance/Bernini.git
pip install -e .

No benchmark numbers could be verified. You will not be able to validate reproduction correctness against published numbers.

Additional implementations

No additional verified repositories beyond the primary recommendation.

Possible but unverified matches (4)

These repositories had low-confidence matching signals and are hidden by default.

neuregex/ComfyUI-BerniniR

Confidence: Low

Stars: 11
RH-RunningHub/ComfyUI-RH-Bernini

Confidence: Low

Stars: 14
RH-RunningHub/ComfyUI-RH-Bernini-Full

Confidence: Low

Stars: 2
gluttony-10/Bernini_glut

Confidence: Low

Stars: 1

Hugging Face artifacts

No direct paper-linked artifacts were found. Showing strongest curated related artifacts for faster exploration.

Models

facebook/mask2former-swin-large-cityscapes-semantic

Curated Related

Downloads: 143,265

Likes: 38

Broaden model search

Diffusion Computer vision Diffusion bernini latent semantic planning

Datasets

No trustworthy dataset matches right now.

Search datasets on Hugging Face

Spaces

No trustworthy demo spaces right now.

Search spaces on Hugging Face

Explore on Hugging Face

Search models Search datasets Search spaces

Research context

Tasks

None detected

Methods

Diffusion

Domains

Computer vision, Natural Language Processing

Evaluation & Human Feedback Data

Open this paper in HFEPX to review benchmark signals, evaluation modes, and human-feedback protocol context.

Open in HFEPX

Explore Similar Papers

Jump to Paper2Code search queries derived from this paper's research context.

Diffusion Computer vision Natural Language Processing

Need human evaluators for your AI research? Scale annotation with expert AI Trainers.

Post a Job Get a Quote