RepoTencent HunyuanTencent Hunyuanpublished Mar 31, 2026seen 5d

Tencent-Hunyuan/OmniWeaving

Python

Open original ↗

Captured source

source ↗
published Mar 31, 2026seen 5dcaptured 14hhttp 200method plain

Tencent-Hunyuan/OmniWeaving

Description: Official Implementation of OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning

Language: Python

License: NOASSERTION

Stars: 878

Forks: 27

Open issues: 3

Created: 2026-03-31T11:10:21Z

Pushed: 2026-04-11T08:19:27Z

Default branch: main

Fork: no

Archived: no

README:

OmniWeaving

🔥🔥🔥 News

  • 📌 OmniWeaving is developed by the HunyuanVideo team and is built upon the latest [HunyuanVideo-1.5](https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5) as the backbone. If you find our work useful, please consider giving this repository a star and citing our paper~
  • 🚀 April 3, 2026: We release the code

and model weights of OmniWeaving.

  • 🏃‍♂️ April 3, 2026: We release the IntelligentVBench.
  • 📖 Mar 26, 2026: We release the OmniWeaving paper on Arxiv.
  • 👋 Mar 25, 2026: We release the webpage of OmniWeaving.

📑 Open-source Plan

  • OmniWeaving
  • [✅] Inference Code
  • [✅] Model Checkpoints
  • [✅] Training Data Construction Code
  • [✅] Training Example Code
  • IntelligentVBench
  • [✅] Test cases
  • [✅] Evaluation Code

📋 Table of Contents

  • [🔥🔥🔥 News](#news)
  • [📑 Open-source Plan](#open-source-plan)
  • [📖 Abstract](#abstract)
  • [🏗 Model Architecture](#model-architecture)
  • [🚀 Supported Tasks](#supported-tasks)
  • [🛠 Preparation](#preparation)
  • [🔑 Inference](#inference)
  • [🗂 Training Data Construction](#training-data-construction)
  • [🎓 Training](#training)
  • [📊 Evaluation on IntelligentVBench](#evaluation-on-intelligentvbench)
  • [🎬 Qualitative Examples](#examples)
  • [📚 Citation](#citation)
  • [🙏 Acknowledgements](#acknowledgements)

📖 Abstract

We propose OmniWeaving, an omni-level video generation model featuring powerful multimodal composition and reasoning-informed capabilities. By leveraging a massive-scale pretraining dataset that encompasses diverse compositional and reasoning-augmented scenarios, OmniWeaving learns to temporally bind interleaved text, multi-image, and video inputs while acting as an intelligent agent to infer complex user intentions for sophisticated video creation. Furthermore, we introduce IntelligentVBench, the first comprehensive benchmark designed to rigorously assess next-level intelligent unified video generation. Extensive experiments demonstrate that OmniWeaving achieves SoTA performance among open-source unified models.

🏗 Model Architecture

Following the paper, OmniWeaving is built as an integrated MLLM + MMDiT + VAE framework for unified free-form video generation. The MLLM serves as the semantic parser for interleaved text, images, and video inputs, mapping them into a high-level semantic space and forwarding its hidden states through an MLP connector. The VAE acts as the visual tokenizer, compressing visual inputs into low-level latents, while the MMDiT uses these semantic conditions together with latent noise to generate semantically aligned, high-fidelity videos.

On this basis, we further introduce two extra improvements tailored for advanced reasoning and composition.

  • (1) Activating Thinking Mode of the MLLM: Direct MLLM encoding of interleaved visual-text inputs often yields semantic ambiguity due to weak intra-correlations and unclear video creation intents. We elevate the MLLM from a passive feature extractor to an active reasoner. By activating the thinking mode to generate intermediate reasoning steps, it autonomously deduces a semantically precise, enhanced prompt. The hidden states of this enhanced prompt are then forwarded alongside the original MLLM features to condition the MMDiT, effectively bridging the cognitive gap between abstract user intent and pixel-level generation.
  • (2) Hidden States DeepStacking: Compositional video generation involving multiple subjects or intricate scenes often relies on both low- and high-level semantic representations. Drawing inspiration from the DeepStacking mechanism in Qwen3-VL, we extract hidden states from a broader range of intermediate MLLM layers to capture a rich semantic spectrum spanning from fine-grained details to high-level abstractions. An MLP connector projects these multi-level features into the MMDiT embedding space. These projected features are then directly added to the corresponding hidden states within the first three layers of the MMDiT conditioning branch, effectively injecting multi-granular semantic guidance into the generative process.

🚀 Supported Tasks

OmniWeaving is flexible in its input and output configurations, supporting a wide range of unified video generation tasks:

Task Input Type Output Description Demo Input Demo Output

Text-to-Video (T2V) Text 📝 Video 🎬 Generating a video from text prompts.

First-Frame-to-Video (I2V) Image 🖼 + Text 📝 Video 🎬 Generating a video based on the first frame.

Key-Frames-to-Video 2 × Images 🖼 + Text 📝 Video 🎬 Generating a video conditioned on start and end frames.

Video-to-Video Editing Video 🎬 + Text 📝 Video 🎬 Instruction-based video manipulation and stylization.

Reference-to-Video Image 🖼 + Text 📝 Video 🎬 Single-subject reference-driven video generation.

Compositional Multi-Image-to-Video 2–4 × Images 🖼 + Text 📝 Video 🎬 Multi-subject compositional video generation.

Text-Image-Video-to-Video Video 🎬 + Image 🖼 + Text 📝 Video 🎬 Generating a video conditioned on text, image, and video inputs.

Reasoning-Augmented Video Generation Image(s) 🖼 + Text 📝 Reasoning 💭 + Video 🎬 Reasoning over user intent before generating the video.

🛠 Preparation

Step 1: Clone the Repository

git clone https://github.com/Tencent-Hunyuan/OmniWeaving
cd OmniWeaving

Step 2: Install Dependencies

OmniWeaving is built upon HunyuanVideo-1.5. The way to install dependencies is similar to HunyuanVideo-1.5. Specifically, you should install basic dependencies:

pip install -r requirements.txt

Additionally, install the attention libraries as needed (we use Flash Attention in practice):

  • Flash Attention: Install for faster inference and reduced GPU memory consumption. See Flash Attention

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

New repo with moderate stars