meituan-longcat/LongCat-Video
Python
Captured source
source ↗meituan-longcat/LongCat-Video
Language: Python
License: MIT
Stars: 4273
Forks: 669
Open issues: 67
Created: 2025-10-25T06:49:49Z
Pushed: 2026-05-27T02:51:41Z
Default branch: main
Fork: no
Archived: no
README:
LongCat-Video
Model Introduction
We introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models.
Key Features
- 🌟 Unified architecture for multiple tasks: LongCat-Video unifies *Text-to-Video*, *Image-to-Video*, and *Video-Continuation* tasks within a single video generation framework. It natively supports all these tasks with a single model and consistently delivers strong performance across each individual task.
- 🌟 Long video generation: LongCat-Video is natively pretrained on *Video-Continuation* tasks, enabling it to produce minutes-long videos without color drifting or quality degradation.
- 🌟 Efficient inference: LongCat-Video generates $720p$, $30fps$ videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions
- 🌟 Strong performance with multi-reward RLHF: Powered by multi-reward Group Relative Policy Optimization (GRPO), comprehensive evaluations on both internal and public benchmarks demonstrate that LongCat-Video achieves performance comparable to leading open-source video generation models as well as the latest commercial solutions.
For more detail, please refer to the comprehensive ***LongCat-Video Technical Report***.
🎥 Teaser Video
🔥 Latest News!!
- May 21, 2026: 🚀 We release ***LongCat-Video-Avatar-1.5***, an upgraded open-source framework for audio-driven human video generation. v1.5 replaces Wav2Vec2 with Whisper-Large for more accurate lip synchronization, achieves production-ready physical rationality and temporal stability with robust long-video generation, generalizes to stylized domains (anime, animals, complex real-world conditions), supports both single-stream and multi-stream audio inputs, and accelerates inference to 8 steps via step distillation. [***code*** | 🤗 ***weights*** | ***project page*** ]
- Dec 16, 2025: 🚀 We are excited to announce the release of ***LongCat-Video-Avatar***, a unified model that delivers expressive and highly dynamic audio-driven character animation, supporting native tasks including *Audio-Text-to-Video*, *Audio-Text-Image-to-Video*, and *Video Continuation* with seamless compatibility for both *single-stream* and *multi-stream* audio inputs. The release includes our ***Technical Report***, ***inference code***, 🤗 ***model weights***, and ***project page***.
- Oct 25, 2025: 🚀 We've released LongCat-Video, a foundational video generation model. Tech report and models are available at ***LongCat-Video Technical Report*** and 🤗 ***Huggingface*** !
Quick Start
Installation
Clone the repo:
git clone --single-branch --branch main https://github.com/meituan-longcat/LongCat-Video cd LongCat-Video
Install dependencies:
# create conda environment conda create -n longcat-video python=3.10 conda activate longcat-video # install torch (configure according to your CUDA version) pip install torch==2.6.0+cu124 torchvision==0.21.0+cu124 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124 # install flash-attn-2 pip install ninja pip install psutil pip install packaging pip install flash_attn==2.7.4.post1 # install other requirements pip install -r requirements.txt # install longcat-video-avatar requirements conda install -c conda-forge librosa conda install -c conda-forge ffmpeg pip install -r requirements_avatar.txt
FlashAttention-2 is enabled in the model config by default; you can also change the model config ("./weights/LongCat-Video/dit/config.json") to use FlashAttention-3 or xformers once installed.
Model Download
| Models | Description | Download Link | | --- | --- | --- | | LongCat-Video | foundational video generation | 🤗 Huggingface | | LongCat-Video-Avatar | single- and multi-character audio-driven video generation (wav2vec2) | 🤗 Huggingface | | LongCat-Video-Avatar-1.5 | upgraded avatar model with Whisper-large-v3 audio encoder, distillation-based fast inference | 🤗 Huggingface |
Download models using huggingface-cli:
pip install "huggingface_hub[cli]" huggingface-cli download meituan-longcat/LongCat-Video --local-dir ./weights/LongCat-Video huggingface-cli download meituan-longcat/LongCat-Video-Avatar --local-dir ./weights/LongCat-Video-Avatar huggingface-cli download meituan-longcat/LongCat-Video-Avatar-1.5 --local-dir ./weights/LongCat-Video-Avatar-1.5
Run Text-to-Video
# Single-GPU inference torchrun run_demo_text_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile # Multi-GPU inference torchrun --nproc_per_node=2 run_demo_text_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
Run Image-to-Video
# Single-GPU inference torchrun run_demo_image_to_video.py --checkpoint_dir=./weights/LongCat-Video --enable_compile # Multi-GPU inference torchrun --nproc_per_node=2 run_demo_image_to_video.py --context_parallel_size=2 --checkpoint_dir=./weights/LongCat-Video --enable_compile
Run Video-Continuation
# Single-GPU inference torchrun…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10High-star video repo from Meituan