Tencent-Hunyuan/HY-SOAR
Python
Captured source
source ↗Tencent-Hunyuan/HY-SOAR
Description: HY-SOAR:Self-Correction for Optimal Alignment and Refinement in Diffusion Models
Language: Python
License: NOASSERTION
Stars: 629
Forks: 64
Open issues: 0
Created: 2026-04-16T06:34:05Z
Pushed: 2026-04-21T16:45:31Z
Default branch: main
Fork: no
Archived: no
README:
Beyond SFT and RL: Self-Correction during Generation without Reward Models, Preference Labels, or Negative Samples.
🔥 News
- April 2026: 🎉 HY-SOAR open source - Training and evaluation code publicly available.
🗂️ Contents
- [🔥 News](#-news)
- [📖 Introduction](#-introduction)
- [✨ Key Features](#-key-features)
- [🖼 Showcases](#-showcases)
- [📑 Open-Source Plan](#-open-source-plan)
- [🛠 Environment Setup](#-environment-setup)
- [🎯 Reward Preparation](#-reward-preparation)
- [🚀 Usage](#-usage)
- [📊 Evaluation](#-evaluation)
- [🧾 Data Format](#-data-format)
- [📚 Citation](#-citation)
- [🙏 Acknowledgement](#-acknowledgement)
---
📖 Introduction
HY-SOAR (Self-Correction for Optimal Alignment and Refinement) is a reward-free post-training method for rectified-flow diffusion models. It targets exposure bias in the denoising trajectory: standard SFT trains the denoiser on ideal forward-noising states from real data, while inference conditions on states produced by the model's own earlier predictions. Once an early denoising step drifts, later steps must recover from states that were not directly optimized, so errors can compound across the trajectory.
Instead of waiting for a terminal reward after a full rollout, SOAR teaches the model to correct its own trajectory errors at the timestep where they occur. Given a clean latent $z_0$, noise endpoint $z_1$, and condition $c$, SOAR: 1. Samples an on-trajectory noisy state and performs one stop-gradient CFG rollout step with the current model 2. Re-noises the resulting off-trajectory state toward the same noise endpoint $z_1$ to create auxiliary states 3. Supervises the denoiser with the analytical correction target $v_{\mathrm{corr}} = (z_{\sigma_{t'}} - z_0) / \sigma_{t'}$
This gives SOAR an on-policy, dense, and reward-free training signal. The base objective subsumes standard SFT, while the auxiliary correction loss trains on nearby model-induced states, making SOAR a stronger first post-training stage that remains compatible with subsequent reward-based alignment.
✨ Key Features
- 🧭 Exposure-Bias Correction: SOAR directly addresses the mismatch between ground-truth training states and model-induced inference states, the source of many compounding denoising failures.
- 🔁 On-Policy Off-Trajectory Supervision: Off-trajectory states are produced by the current model's own rollout, so the training distribution co-evolves with the model instead of staying fixed to the SFT data trajectory.
- 🎯 Reward-Free Dense Objective: SOAR requires no reward model, preference labels, or negative samples. It provides per-timestep correction supervision and avoids terminal-reward credit assignment.
- 📐 Geometric Correction Target: Re-noising uses the same noise endpoint as the base flow-matching pair, keeping auxiliary states near the original transport ray and yielding a concrete correction velocity anchored to $z_0$.
- 🔧 Compatible Post-Training Stage: The SOAR loss extends the standard flow-matching objective, so it can replace SFT as a stronger first post-training stage while remaining compatible with later RL alignment.
🖼 Showcases
Showcase 1: Aesthetic Reward Optimization
Comparison of SOAR vs Flow-GRPO vs SFT across training steps, optimizing for aesthetic quality on diverse prompts (historical scenes, fantasy art, character portraits).
Showcase 2: CLIPScore Reward Optimization
Comparison on design and poster generation prompts, optimizing for text-image alignment (CLIPScore). SOAR demonstrates stronger text rendering and compositional fidelity.
Showcase 3: WebUI / Design Generation
SOAR results on web UI and graphic design generation, showing accurate layout, typography, and visual hierarchy.
📑 Open-Source Plan
- HY-SOAR
- [x] Training code
- [x] Evaluation code
🛠 Environment Setup
Our implementation is based on the DiffusionNFT and Flow-GRPO codebases, with most environments aligned.
Clone this repository and install packages by:
git clone https://github.com/Tencent-Hunyuan/HY-SOAR.git cd HY-SOAR conda create -n hy-soar python=3.10.16 conda activate hy-soar pip install torch==2.6.0 torchvision==0.21.0 --index-url https://download.pytorch.org/whl/cu126 pip install -e . export PYTHONPATH=$PWD/sora:$PYTHONPATH
Base Model: The training script expects stabilityai/stable-diffusion-3.5-medium as the pretrained model. You need to accept the model license on Hugging Face and authenticate via huggingface-cli login before training.
🎯 Reward Preparation
Our supported reward models include GenEval, OCR, PickScore, ClipScore, HPSv2.1, Aesthetic, and ImageReward. We additionally support HPSv2.1 on top of FlowGRPO, and simplify GenEval from remote server to local.
📦 Checkpoints Downloading
mkdir reward_ckpts cd reward_ckpts # Aesthetic wget https://github.com/christophschuhmann/improved-aesthetic-predictor/raw/refs/heads/main/sac+logos+ava1-l14-linearMSE.pth # GenEval wget https://download.openmmlab.com/mmdetection/v2.0/mask2former/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco/mask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco_20220504_001756-743b7d99.pth # ClipScore wget https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K/resolve/main/open_clip_pytorch_model.bin # HPSv2.1 wget https://huggingface.co/xswu/HPSv2/resolve/main/HPS_v2.1_compressed.pt cd ..
🧪 Reward Environments
# GenEval pip install -U openmim mim install mmengine git clone https://github.com/open-mmlab/mmcv.git cd mmcv; git checkout 1.x MMCV_WITH_OPS=1 FORCE_CUDA=1 pip install -e . -v cd .. git clone https://github.com/open-mmlab/mmdetection.git cd…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10Solid traction, notable but not frontier.