RepoTencent HunyuanTencent Hunyuanpublished Mar 24, 2026seen 5d

Tencent-Hunyuan/SAGE-GRPO

Python

Open original ↗

Captured source

source ↗
published Mar 24, 2026seen 5dcaptured 14hhttp 200method plain

Tencent-Hunyuan/SAGE-GRPO

Description: Official Implementation of SAGE-GRPO:Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Language: Python

License: NOASSERTION

Stars: 123

Forks: 2

Open issues: 2

Created: 2026-03-24T07:59:12Z

Pushed: 2026-04-02T06:08:46Z

Default branch: master

Fork: no

Archived: no

README:

Manifold-Aware Exploration for Reinforcement Learning in Video Generation

Figure 1. Illustration of SAGE-GRPO. (Left) (a.1) At higher noise regions, Euler-style discretization introduces extra energy (discretization error) beyond the true integral. (a.2) Our precise SDE removes unnecessary noise energy in high-noise regions, enabling more precise exploration and a better-learned data manifold. (Right) (b) Our method with improved exploration yields more stable and better-aligned generations compared with DanceGRPO, FlowGRPO, and CPS.

Highlights

We formulate GRPO for video generation as a manifold-constrained exploration problem:

Figure 2. Geometric interpretation of noise injection strategies. Conventional linear SDEs (red) inject exploration noise using first-order approximations, causing off-manifold drift and temporal jitter. Our Manifold-Aware SDE (blue) uses a logarithmic correction term so that exploration noise stays close to the flow trajectory and the video manifold.

  • Core Problem: We show that the ODE-to-SDE conversions used in existing video GRPO methods can inject excess noise in high-noise steps, which reduces rollout quality and makes reward-guided updates less reliable.
  • Micro-level: We constrain exploration with a *Precise Manifold-Aware SDE* and a *Gradient Norm Equalizer*, so that sampling noise stays manifold-consistent and updates are balanced across timesteps.
  • Macro-level: We constrain long-horizon exploration with a *Dual Trust Region* using moving anchors and step-wise constraints, so that the trust region tracks more manifold-consistent checkpoints and prevents drift.

Abstract

Group Relative Policy Optimization (GRPO) methods for video generation like FlowGRPO remain far less reliable than their counterparts for language models and images. This gap arises because video generation has a complex solution space, and the ODE-to-SDE conversion used for exploration can inject excess noise, lowering rollout quality and making reward estimates less reliable, which destabilizes post-training alignment.

To address this problem, we view the pre-trained model as defining a valid video data manifold and formulate the core problem as constraining exploration within the vicinity of this manifold, ensuring that rollout quality is preserved and reward estimates remain reliable.

We propose SAGE-GRPO (Stable Alignment via Exploration), which applies constraints at both micro and macro levels. At the micro level, we derive a *precise manifold-aware SDE* with a logarithmic curvature correction and introduce a *gradient norm equalizer* to stabilize sampling and updates across timesteps. At the macro level, we use a *dual trust region* with a periodic moving anchor and stepwise constraints so that the trust region tracks checkpoints that are closer to the manifold and limits long-horizon drift.

We evaluate SAGE-GRPO on HunyuanVideo-1.5 using VideoAlign as the reward model and observe consistent gains over previous methods in VQ, MQ, TA, and visual metrics (CLIPScore, PickScore), demonstrating superior performance in both reward maximization and overall video quality.

Table of Contents

  • [Highlights](#highlights)
  • [Abstract](#abstract)
  • [Installation](#installation)
  • [Checkpoint Preparation](#checkpoint-preparation)
  • [Post-Training](#post-training)
  • [Key Training Parameters](#key-training-parameters)
  • [Recommended 64-GPU Default](#recommended-64-gpu-default)
  • [Visualization Gallery](#visualization-gallery)
  • [Acknowledgements](#acknowledgements)
  • [License](#license)
  • [Citation](#citation)

Installation

1. Clone the repository

git clone
cd SAGE-GRPO

2. Install Python dependencies

pip install -r requirements.txt

3. Download the reward model helper

bash download_weights.sh

4. Download the remaining HunyuanVideo checkpoints

After download_weights.sh, follow checkpoints-download.md to download the remaining base model, text encoder, and vision encoder weights.

Checkpoint Preparation

SAGE-GRPO expects both the HunyuanVideo-1.5 base checkpoints and the VideoReward reward model to be available under ./ckpts.

Useful references:

  • Base model documentation: README_HYVideo.md
  • Detailed checkpoint download instructions: checkpoints-download.md
  • Reward checkpoint helper: download_weights.sh

Expected Checkpoint Layout

ckpts/
├── assets
├── config.json
├── LICENSE
├── NOTICE
├── README.md
├── README_CN.md
├── scheduler
├── text_encoder
│ ├── byt5-small
│ ├── Glyph-SDXL-v2
│ └── llm
├── transformer
├── upsampler
├── vae
├── VideoReward
│ ├── checkpoint-11352
│ ├── model_config.json
│ └── README.md
└── vision_encoder
└── siglip

If your local structure differs substantially from the above, training usually fails during model or reward initialization.

Post-Training

Hardware Recommendation

| Requirement | Recommended | | --- | --- | | GPU memory | 80 GB per GPU | | GPU count | 64 GPUs (8 nodes x 8) | | OS | Linux | | PyTorch | 2.6+ |

Single-node multi-GPU

For a single machine with 8 GPUs:

bash run_post_train.sh

This launches post_train.py with the default GRPO configuration via torchrun --nproc_per_node=8.

Multi-node multi-GPU

For multi-node training:

bash run_post_train_multinode.sh

The multi-node entry internally calls:

bash scripts/post_train/pdsh_train.sh "scripts/post_train/train_grpo.sh"

Edit or export the node list and rendezvous-related environment expected by your cluster launcher before starting.

Key Training Parameters

Distributed Training

The three most important distributed-training knobs are sp_size, batch_size, and num_generations.

dp_degree = world_size / sp_size

There is a validity constraint:

(batch_size * dp_degree) % num_generations == 0

| Parameter | Default | Description | | --- | --- | --- | | sp_size | 8 | Sequence parallel degree. Must evenly divide world_size. | | batch_size | 2 | Per-rank video micro-batch size. | | num_generations | 4 | Number of…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo from Tencent, moderate stars