ModelTencent HunyuanTencent Hunyuanpublished Jun 11, 2026seen 1w

tencent/Hy-Embodied-0.5-VLA-RoboTwin

Open original ↗

Captured source

source ↗
published Jun 11, 2026seen 1wcaptured 1whttp 200method plaintask roboticslicense apache-2.0params 4.5Bdownloads 210likes 7

📖 Abstract

We introduce Hy-Embodied-0.5-VLA (Hy-VLA) — an end-to-end Vision-Language-Action system that spans the full robot learning stack: data collection, model design, pre-training, supervised fine-tuning, RL post-training, and real-world deployment. Built on the Hy-Embodied-0.5 MoT backbone, Hy-VLA integrates a flow-matching action expert, a compact memory encoder for multi-frame history, and a delta-chunk action representation decoupled from embodiment-specific kinematics.

Powered by 10,000+ hours of high-fidelity UMI demonstrations collected via a custom fingertip interface with optical motion-capture, Hy-VLA achieves state-of-the-art results on the RoboTwin 2.0 benchmark (90.9% / 90.1% on Clean / Randomized) and demonstrates robust cross-embodiment transfer across four real-world robot platforms. Paired with FlowPRO preference optimization and an asynchronous inference framework, Hy-VLA establishes a scalable paradigm for continuous dexterous manipulation.

Overview

Hy-VLA-RoboTwin is the supervised fine-tuned (SFT) checkpoint of Hy-Embodied-0.5-VLA (Hy-VLA), an end-to-end Vision-Language-Action system built on the Hy-Embodied-0.5 MoT backbone. Fine-tuned on all 50 bimanual manipulation tasks in the RoboTwin 2.0 benchmark, it achieves 90.9% (Clean) / 90.1% (Randomized) average success rate — state-of-the-art among published VLA methods.

Architecture

Same as Hy-Embodied-0.5-VLA-UMI, with the following SFT-specific settings:

  • Video Encoder: K=6 frames (current + 5 historical), temporal-spatial attention enabled
  • Action Horizon: H=20 at 3× downsampled rate (effective horizon covers a longer time span)
  • All other architecture parameters identical to the pre-trained checkpoint

Training

| Property | Value | |---|---| | Data | RoboTwin 2.0: 50 tasks × 550 episodes (50 clean + 500 randomized) | | Initialization | tencent/Hy-Embodied-0.5-VLA-UMI | | Objective | Conditional flow matching | | Global batch size | 128 | | Learning rate | 5e-5 (warmup 1K → cosine decay to 5e-6 over 150K) | | Optimizer | AdamW, bfloat16 mixed precision | | Hardware | 32 GPUs (4 nodes × 8) |

Evaluation Performance (RoboTwin 2.0)

| Setting | Success Rate | |---|---| | Clean | 90.9% | | Randomized | 90.1% |

Contents

tencent/Hy-Embodied-0.5-VLA-RoboTwin/
├── model.safetensors # Model weights
├── config.json # HyVLA configuration
├── tokenizer.json # Tokenizer for the VLM backbone
├── tokenizer_config.json
├── special_tokens_map.json
├── chat_template.jinja # Chat template for instruction formatting
├── preprocessor_config.json # Image preprocessing config
├── norm_stats.pkl # Pre-computed normalization statistics
└── LICENSE

Usage

Basic Loading

import torch
from huggingface_hub import snapshot_download
from hy_vla import HyVLA, HyVLAConfig

ckpt = snapshot_download("tencent/Hy-Embodied-0.5-VLA-RoboTwin")

config = HyVLAConfig.from_pretrained(ckpt)
policy = HyVLA.from_pretrained(ckpt, config=config)
policy.enable_video_encoder_if_needed()
policy = policy.to(device="cuda", dtype=torch.bfloat16).eval()

# (B, K, C, H, W); K=6 history slots
img = torch.zeros(1, 6, 3, 224, 224, device="cuda", dtype=torch.bfloat16)
# Normalized dual-arm EEF: [xyz(3) + rot6d(6) + gripper(1)] * 2
state = torch.zeros((1, config.max_state_dim), device="cuda", dtype=torch.bfloat16)
batch = {
"observation.images.top_head": img,
"observation.images.hand_left": img,
"observation.images.hand_right": img,
"observation.state": state,
"task": ["pick up the bottle"],
}

with torch.no_grad():
actions = policy.forward_evaluate(batch)["pred"]
actions = actions[..., : config.action_feature.shape[0]]
print(actions.shape)

RoboTwin Evaluation

To reproduce the benchmark results:

export ROBOTWIN_DIR=/path/to/RoboTwin
export CKPT_PATH=tencent/Hy-Embodied-0.5-VLA-RoboTwin

# Quick regression (6 tasks × 10 rollouts)
bash scripts/eval_robotwin_test.sh

# Full sweep (50 tasks × 100 rollouts, 8 GPUs)
bash scripts/eval_robotwin_full.sh

> Note: The eval scripts automatically symlink Hy-VLA/robotwin_eval/RoboTwin/policy/hy_vla, so that RoboTwin's eval_policy.py can discover the Hy-VLA policy adapter without any manual configuration.

Normalization Statistics

The checkpoint includes pre-computed norm_stats.pkl from the RoboTwin 2.0 training data. For fine-tuning on new robot platforms, regenerate using:

python scripts/compute_norm_hdf5.py \
--csv /path/to/episodes.csv \
--hdf5-dir /path/to/hdf5 \
--output norm_stats.pkl

📚 Citation

If you find Hy-VLA useful for your research, please cite:

@article{tencent2026hyembodied05vla,
title={Hy-Embodied-0.5-VLA: From Vision-Language-Action Models to a Real-World Robot Learning Stack},
author={Tencent Robotics X and Tencent Hy Team},
journal={arXiv preprint arXiv:2606.14409},
year={2026}
}

License

This model is released under Apache-2.0. The base model tencent/Hy-Embodied-0.5-VLA-UMI is also released under Apache-2.0.

Notability

notability 7.0/10

Notable embodied VLA model release by Tencent Hunyuan.