zai-org/SCAIL-2
Python
Captured source
source ↗zai-org/SCAIL-2
Description: Official Implementation of SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning
Language: Python
License: Apache-2.0
Stars: 177
Forks: 7
Open issues: 2
Created: 2026-05-28T11:38:02Z
Pushed: 2026-06-10T01:50:21Z
Default branch: wan-scail2
Fork: no
Archived: no
README: SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning
This repository contains the official implementation code of SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning. The code is for the inference of SCAIL-2 Model, an open-source model to support End-to-End Character Animation.
🔎 Motivation and Results
SCAIL-1 identifies the key bottlenecks that hinder character animation towards production level: how to represent the pose and how to inject the pose. However, the reliance on intermediate pose representation still hinders the model towards complex motion and generalizable identity. We define the issue as over reliance on intermediates.
As intermediates, skeleton maps suffer from inherent ambiguity under complex scenarios. Further, it restricts the driving source to be exocentric human movements and thus cannot handle driving sources like animals. Character replacement and multi-character animation suffers from similar issues, where state-of-the-art methods use inpainting masks, but such masks are still a form of intermediates and limits the application and bounds the performance.
To bypass intermediate pose representation, we utilize several off-the-shelf models, including SCAIL-Preview, Wan-Animate, MoCha to synthesize 60K motion pairs. By designing a Unified Motion Transfer Interface containing 2 type of masking channels and a dedicated RoPE design, we support training with all those data. We utilize reserve driving, so that the model can learn capabilities beyond those models. From the data composition and the training recipe, the final model yield emergent capabilities. For example, it supports cross-identity replacement, animal-driving scenarios, and support more advanced control intermediate like SAM3D-Body's mesh rendering in zero-shot manner.
🚀 Getting Started
Checkpoints Download
| ckpts | Download Link | Notes | |--------------|------------------------------------------------------------------------------------------------------------------------------|-------------------------------| | SCAIL-2 | 🤗 Hugging Face 🤖 ModelScope | Trained with mixed resolutions and fps. End-to-end driven supports both 512p and 704p. Pose-driven performs better under 704p. H and W should be both divisible by 32 (e.g. 704*1280) if using other resolutions. |
Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).
hf download zai-org/SCAIL-2
The files should be organized like:
SCAIL-2/ ├── Wan2.1_VAE.pth ├── model │ ├── 1 │ │ └── fsdp2_rank_0000_checkpoint.pt │ └── latest └── umt5-xxl ├── ...
The model weights are intended for sat branch, for usage in wan branch, convert to safetensors format:
python convert.py --scail-dir /path/to/SCAIL-2 --save-path /path/to/SCAIL-2.safetensors
Environment Setup
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
Input Preparation
SCAIL-Pose contains the preprocessing code used to prepare SCAIL-2 inputs, including pose extraction, pose rendering, reference masks, and driving-video masks. It can prepare both animation inputs and character replacement inputs. The submodule should live under the project root:
SCAIL-2/ ├── generate.py ├── examples/ ├── SCAIL-Pose/ └── ...
After cloning this repository, initialize the submodule:
git submodule update --init --recursive
Enter the submodule and follow its environment setup. SCAIL-Pose recommends an OpenMMLab/MMPose environment, then installing its own requirements:
cd SCAIL-Pose pip install -r requirements.txt
Download the pose-preprocessing weights inside SCAIL-Pose/pretrained_weights. The required layout is:
pretrained_weights/ ├── nlf_l_multi_0.3.2.torchscript └── DWPose/ ├── dw-ll_ucoco_384.onnx └── yolox_l.onnx
For SCAIL-2 animation, SCAIL-Pose provides an all-in-one preprocessing entrypoint:
# Recommended end-to-end mode: rendered_v2.mp4 is the driving video copy, # and the mask video is generated from SAM3 masks. python NLFPoseExtract/process_animation_aio.py --subdir /path/to/input --e2e_mode # Pose-driven mode: runs NLF + DWPose and writes a skeleton render. python NLFPoseExtract/process_animation_aio.py --subdir /path/to/input
For character replacement, use:
python NLFPoseExtract/process_replacement.py --subdir /path/to/input # If the driving video has multiple people and only one should be replaced: python NLFPoseExtract/process_replacement.py --subdir /path/to/input --matchnearest
The preprocessing outputs are written back to the example folder and can be passed to generate.py as --image, --mask_image, --pose, and --mask_video.
🦾 Usage
Input Preparation
generate.py runs one SCAIL-2 inference job from four local input files:
examples/001/ ├── ref.jpg # reference character image ├── ref_mask.jpg # foreground mask of the reference image ├── rendered_v2.mp4 # driving / pose video consumed by --pose └── rendered_mask_v2.mp4 # per-frame driving mask consumed by --mask_video
The paths passed to --image, --mask_image, --pose, and --mask_video must exist. The script checks them before loading the image/video data.
For animation mode, --pose can be an end-to-end driving video or a pose-rendered video, depending on how the sample was prepared. --mask_video should be the corresponding per-frame foreground/control mask. For replacement mode, pass --replace_flag and provide the replacement-region mask through --mask_video.
Prompt Semantics
For both animation and character replacement, --prompt should describe the generated video itself. It should not be an…
Excerpt shown — open the source for the full document.