RepoTencent HunyuanTencent Hunyuanpublished Jan 6, 2026seen 5d

Tencent-Hunyuan/HY-Video-PRFL

Python

Open original ↗

Captured source

source ↗
published Jan 6, 2026seen 5dcaptured 16hhttp 200method plain

Tencent-Hunyuan/HY-Video-PRFL

Language: Python

License: NOASSERTION

Stars: 91

Forks: 2

Open issues: 0

Created: 2026-01-06T11:59:33Z

Pushed: 2026-01-13T06:19:01Z

Default branch: main

Fork: no

Archived: no

README: [中文文档](./README_CN.md)

HY-Video-PRFL

Video generation models can both create and evaluate — we enable 14B models to complete full 720P×81-frame post-training within 67GB VRAM, achieving 1.5× faster speed and 56% improvement in motion quality over traditional methods.

![image](assets/teaser.jpg)

> **HY-Video-PRFL: Video Generation Models Are Good Latent Reward Models**

🔥🔥🔥 News!!

  • Dec 07, 2025: 👋 We release the training and inference code of HY-Video-PRFL.
  • Nov 26, 2025: 👋 We release the paper and project page. [Paper] [Project Page]

📑 Open-source Plan

  • HY-Video-PRFL
  • [x] Training and inference code for PAVRM
  • [x] Training and inference code for PRFL

📋 Table of Contents

  • [🔥🔥🔥 News!!](#-news)
  • [📑 Open-source Plan](#-open-source-plan)
  • [📖 Abstract](#-abstract)
  • [🏗️ Model Architecture](#-model-architecture)
  • [📊 Performance](#-performance)
  • [🎬 Case Show](#-case-show)
  • [📜 Requirements](#-requirements)
  • [🛠️ Installation](#-installation)
  • [🧱 Download Models](#-download-models)
  • [🎓 Training](#-training)
  • [🚀 Inference](#-inference)
  • [📝 Citation](#-citation)
  • [🙏 Acknowledgements](#-acknowledgements)

📖 Abstract

Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding.

HY-Video-PRFL introduces Process Reward Feedback Learning (PRFL), a framework that conducts preference optimization entirely in latent space. We demonstrate that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.

Key advantages:

  • ✅ Efficient latent-space optimization
  • ✅ Significant memory savings
  • ✅ 1.4X faster training compared to RGB ReFL
  • ✅ Better alignment with human preferences

🏗️ Model Architecture

![image](assets/method.png)

Traditional RGB ReFL relies on vision-language models designed for pixel-space inputs, requiring expensive VAE decoding and confining optimization to late-stage denoising steps.

Our PRFL approach leverages pre-trained video generation models as reward models in the noisy latent space. This enables:

  • Full-chain gradient backpropagation without VAE decoding
  • Early-stage supervision for motion dynamics and structure coherence
  • Substantial reductions in memory consumption and training time

📊 Performance

Quantitative Results

Our experiments demonstrate that PRFL achieves substantial motion quality improvements (with +56.00 in dynamic degree, +21.52 in human anatomy and superior alignment with human preferences) as well as significant efficiency gains (with at least 1.4X faster training and notable memory savings).

Text-to-Video Results

![image](assets/T2V_exp.png)

Image-to-Video Results

![image](assets/I2V_exp.png)

Efficiency Comparison

🎬 Case Show

Text-to-Video Generation

|480P Resolution|720P Resolution| |---|---| | 📋 Show prompt``Two shirtless men with short dark hair are sparring in a dimly lit room. They are both wearing boxing gloves, one red and one black. One man is wearing white shorts while the other is wearing black shorts. There are several screens on the wall displaying images of buildings and people.| 📋 Show promptA woman with fair skin, dark hair tied back, and wearing a light green t-shirt is visible against a gray background. She uses both hands to apply a white substance from below her eyes upward onto her face. Her mouth is slightly open as she spreads the cream.| | 📋 Show promptThe woman has dark eyes and is holding a black smartphone to her ear with her right hand. She is typing on the keyboard of an open silver laptop computer with her left hand. Her fingers have blue nail polish. She is sitting in front of a window covered by sheer white curtains.| 📋 Show promptA light-skinned man with short hair wearing a yellow baseball cap, plaid shirt, and blue overalls stands in a field of sunflowers. He holds a cut sunflower head in his left hand and touches it with his right index finger. Several other sunflowers are visible in the background, some facing away from the camera.``|

Image-to-Video Generation

|480P Resolution|720P Resolution| |---|---| ||| | 📋 Show prompt``A monochromatic video capturing a cat's gaze into the camera| 📋 Show promptA young boy is jumping in the mud| ||| 📋 Show promptA family of four eats fast food at a table.| 📋 Show prompt``Normal speed, Medium shot, Eye level angle, Third person viewpoint, Static camera movement, Frame-within-frame composition, Shallow depth of field, Natural light, Cinematic style, Desaturated palette with slate blue, dusty rose, and dark wood tones color palette, Dramatic atmosphere. The scene is set on a patio or veranda, framed by a stone archway. In the back, there is a large, weathered wooden gate set into a stone wall. Six people are gathered on a stone patio in front of a large wooden gate. On the right, two men are seated at a dark wooden table. An older man in a grey traditional jacket holds a cane and gestures with his right hand while speaking. A younger man in a light grey suit sits beside him, listening. On the left side of the frame, a man in a dark suit stands with his back to the camera. Next to him, a woman in a pink patterned cheongsam and a woman in a grey skirt suit are standing close together, whispering. The women then turn and smile towards the men at the table. The man in the dark suit turns to face the group, revealing a newborn baby cradled in his arms, wrapped in a pink blanket. He takes a few steps forward, holding the baby. The women look at him and the infant. The older man at the table continues to talk, now gesturing towards the man with the baby. The man holding the baby looks down at the infant as he…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New repo from Tencent, modest stars.