ForkDeepInfraDeepInfrapublished Oct 21, 2024seen 5d

deepinfra/Pyramid-Flow

forked from jy0205/Pyramid-Flow

Open original ↗

Captured source

source ↗
published Oct 21, 2024seen 5dcaptured 15hhttp 200method plain

deepinfra/Pyramid-Flow

Description: Code of Pyramidal Flow Matching for Efficient Video Generative Modeling

License: MIT

Stars: 0

Forks: 0

Open issues: 0

Created: 2024-10-21T12:27:40Z

Pushed: 2024-10-21T12:30:01Z

Default branch: main

Fork: yes

Parent repository: jy0205/Pyramid-Flow

Archived: no

README:

This is the official repository for Pyramid Flow, a training-efficient Autoregressive Video Generation method based on Flow Matching. By training only on open-source datasets, it can generate high-quality 10-second videos at 768p resolution and 24 FPS, and naturally supports image-to-video generation.

10s, 768p, 24fps 5s, 768p, 24fps Image-to-video

News

  • COMING SOON ⚡️⚡️⚡️ Training code for both the Video VAE and DiT; New model checkpoints trained from scratch.

> We are training Pyramid Flow from scratch to fix human structure issues related to the currently adopted SD3 initialization and hope to release it in the next few days.

  • 2024.10.13 ✨✨✨ [Multi-GPU inference](#3-multi-gpu-inference) and [CPU offloading](#cpu-offloading) are supported. Use it with less than 8GB of GPU memory, with great speedup on multiple GPUs.
  • 2024.10.11 🤗🤗🤗 Hugging Face demo is available. Thanks @multimodalart for the commit!
  • 2024.10.10 🚀🚀🚀 We release the technical report, project page and model checkpoint of Pyramid Flow.

Introduction

![motivation](assets/motivation.jpg)

Existing video diffusion models operate at full resolution, spending a lot of computation on very noisy latents. By contrast, our method harnesses the flexibility of flow matching (Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023) to interpolate between latents of different resolutions and noise levels, allowing for simultaneous generation and decompression of visual content with better computational efficiency. The entire framework is end-to-end optimized with a single DiT (Peebles & Xie, 2023), generating high-quality 10-second videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours.

Installation

We recommend setting up the environment with conda. The codebase currently uses Python 3.8.10 and PyTorch 2.1.2, and we are actively working to support a wider range of versions.

git clone https://github.com/jy0205/Pyramid-Flow
cd Pyramid-Flow

# create env using conda
conda create -n pyramid python==3.8.10
conda activate pyramid
pip install -r requirements.txt

Then, you can directly download the model from Huggingface. We provide both model checkpoints for 768p and 384p video generation. The 384p checkpoint supports 5-second video generation at 24FPS, while the 768p checkpoint supports up to 10-second video generation at 24FPS.

from huggingface_hub import snapshot_download

model_path = 'PATH' # The local directory to save downloaded checkpoint
snapshot_download("rain1011/pyramid-flow-sd3", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')

Usage

1. Quick start with Gradio

To get started, first install Gradio, set your model path at #L32, and then run on your local machine:

python app.py

The Gradio demo will be opened in a browser. Thanks to @tpc2233 the commit, see #48 for details.

Or, try it out effortlessly on Hugging Face Space 🤗 created by @multimodalart. Due to GPU limits, this online demo can only generate 25 frames (export at 8FPS or 24FPS). Duplicate the space to generate longer videos.

2. Inference Code

To use our model, please follow the inference code in video_generation_demo.ipynb at this link. We further simplify it into the following two-step procedure. First, load the downloaded model:

import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers.utils import load_image, export_to_video

torch.cuda.set_device(0)
model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16 yet)

model = PyramidDiTForVideoGeneration(
'PATH', # The downloaded checkpoint dir
model_dtype,
model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p'
)

model.vae.enable_tiling()
# model.vae.to("cuda")
# model.dit.to("cuda")
# model.text_encoder.to("cuda")

# if you're not using sequential offloading bellow uncomment the lines above ^
model.enable_sequential_cpu_offload()

Then, you can try text-to-video generation on your own prompts:

prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
frames = model.generate(
prompt=prompt,
num_inference_steps=[20, 20, 20],
video_num_inference_steps=[10, 10, 10],
height=768,
width=1280,
temp=16, # temp=16: 5s, temp=31: 10s
guidance_scale=9.0, # The guidance for the first frame, set it to 7 for 384p variant
video_guidance_scale=5.0, # The guidance for the other video latent
output_type="pil",
save_memory=True, # If you have enough GPU memory, set it to `False` to improve vae decoding speed
)

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)

As an autoregressive model, our model also supports (text conditioned) image-to-video generation:

image = Image.open('assets/the_great_wall.jpg').convert("RGB").resize((1280, 768))
prompt = "FPV flying over the Great Wall"

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
frames =…

Excerpt shown — open the source for the full document.

Notability

notability 1.0/10

Routine fork, no new content.