RepoMeituan (LongCat)Meituan (LongCat)published Oct 25, 2025seen 5d

meituan-longcat/LongCat-Video

Python

Open original ↗

Captured source

source ↗
published Oct 25, 2025seen 5dcaptured 14hhttp 200method plain

meituan-longcat/LongCat-Video

Language: Python

License: MIT

Stars: 4273

Forks: 669

Open issues: 67

Created: 2025-10-25T06:49:49Z

Pushed: 2026-05-27T02:51:41Z

Default branch: main

Fork: no

Archived: no

README:

LongCat-Video

Model Introduction

We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models.

Key Features

  • 🌟 Unified architecture for multiple tasks: LongCat-Video unifies *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* tasks within a single video generation framework. It natively supports all these tasks with a single model and consistently delivers strong performance across each individual task.
  • 🌟 Long video generation: LongCat-Video is natively pretrained on *Video-Continuation* tasks, enabling it to produce minutes-long videos without color drifting or quality degradation.
  • 🌟 Efficient inference: LongCat-Video generates $720p$, $30fps$ videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions
  • 🌟 Strong performance with multi-reward RLHF: Powered by multi-reward Group Relative Policy Optimization (GRPO), comprehensive evaluations on both internal and public benchmarks demonstrate that LongCat-Video achieves performance comparable to leading open-source video generation models as well as the latest commercial solutions.

For more detail, please refer to the comprehensive ***LongCat-Video Technical Report***.

🎥 Teaser Video

🔥 Latest News!!

  • May 21, 2026: 🚀 We release ***LongCat-Video-Avatar-1.5***, an upgraded open-source framework for audio-driven human video generation. v1.5 replaces Wav2Vec2 with Whisper-Large for more accurate lip synchronization, achieves production-ready physical rationality and temporal stability with robust long-video generation, generalizes to stylized domains (anime, animals, complex real-world conditions), supports both single-stream and multi-stream audio inputs, and accelerates inference to 8 steps via step distillation. [***code*** | 🤗 ***weights*** | ***project page*** ]
  • Dec 16, 2025: 🚀 We are excited to announce the release of ***LongCat-Video-Avatar***, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including *Audio-Text-to-Video*, *Audio-Text-Image-to-Video*, and *Video Continuation* with seamless compatibility for both *single-stream* and *multi-stream* audio inputs. The release includes our ***Technical Report***, ***inference code***, 🤗 ***model weights***, and ***project page***.
  • Oct 25, 2025: 🚀 We've released LongCat-Video, a foundational video generation model. Tech report and models are available at ***LongCat-Video Technical Report*** and 🤗 ***Huggingface*** !

Quick Start

Installation

Clone the repo:

git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video
cd LongCat-Video

Install dependencies:

# create conda environment
conda create -n longcat-video python=3.10
conda activate longcat-video

# install torch (configure according to your CUDA version)
pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124

# install flash-attn-2
pip install ninja
pip install psutil
pip install packaging
pip install flash_attn==2.7.4.post1

# install other requirements
pip install -r requirements.txt

# install longcat-video-avatar requirements
conda install -c conda-forge librosa
conda install -c conda-forge ffmpeg
pip install -r requirements_avatar.txt

FlashAttention-2 is enabled in the model config by default; you can also change the model config ("./weights/LongCat-Video/dit/config.json") to use FlashAttention-3 or xformers once installed.

Model Download

| Models | Description | Download Link | | --- | --- | --- | | LongCat-Video | foundational video generation | 🤗 Huggingface | | LongCat-Video-Avatar | single- and multi-character audio-driven video generation (wav2vec2) | 🤗 Huggingface | | LongCat-Video-Avatar-1.5 | upgraded avatar model with Whisper-large-v3 audio encoder, distillation-based fast inference | 🤗 Huggingface |

Download models using huggingface-cli:

pip install "huggingface_hub[cli]"
huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video
huggingface-cli download meituan-longcat/LongCat-Video-Avatar --local-dir ./weights/LongCat-Video-Avatar
huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5

Run Text-to-Video

# Single-GPU inference
torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile

Run Image-to-Video

# Single-GPU inference
torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile

# Multi-GPU inference
torchrun --nproc_per_node=2 run_demo_image_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile

Run Video-Continuation

# Single-GPU inference
torchrun…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

High-star video repo from Meituan