What does this model signal mean?

Tencent Hunyuan published tencent/HY-Embodied-0.5. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license other · 232 HF downloads · Tencent's embodied AI model for robotics, version 0.5.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

Tencent Hunyuan Model: tencent/HY-Embodied-0.5

Captured source

source ↗

Hugging Face/huggingface.co/tencent/HY-Embodied-0.5

tencent/HY-Embodied-0.5 model card

Source ↗

published Apr 2, 2026seen Jun 6captured Jun 11http 200method plaintask image-text-to-textlicense otherlibrary transformersparams 3.8Bdownloads 232likes 912

🔥 Updates

`[2026-04-09]` 🚀 We have released HY-Embodied-0.5, featuring the open-sourced HY-Embodied-0.5 MoT-2B weights on Hugging Face along with the official inference code\!

📖 Abstract

We introduce HY-Embodied-0.5, a suite of foundation models tailored specifically for real-world embodied intelligence. To bridge the gap between general Vision-Language Models (VLMs) and the strict demands of physical agents, our models are engineered to excel in spatial-temporal visual perception and complex embodied reasoning (prediction, interaction, and planning).

The suite features an innovative Mixture-of-Transformers (MoT) architecture utilizing latent tokens for modality-specific computing, significantly enhancing fine-grained perception. It includes two primary variants: a highly efficient 2B model for edge deployment and a powerful 32B model for complex reasoning. Through a self-evolving post-training paradigm and large-to-small on-policy distillation, our compact MoT-2B outperforms state-of-the-art models of similar size across 16 benchmarks, while the 32B variant achieves frontier-level performance comparable to Gemini 3.0 Pro. Ultimately, HY-Embodied serves as a robust "brain" for Vision-Language-Action (VLA) pipelines, delivering compelling results in real-world physical robot control.

⭐️ Key Features

🧠 Evolved MoT Architecture: Designed for maximum efficiency without sacrificing visual acuity. The MoT-2B variant contains 4B total parameters but requires only 2.2B activated parameters during inference. By emphasizing modality-specific computing in the vision pathway, it achieves the high inference speed of a dense 2B model while delivering superior, fine-grained perceptual representations.
🔗 High-Quality Mixed Chain Reasoning: We introduce an advanced iterative, self-evolving post-training pipeline. By employing on-policy distillation, we successfully transfer the sophisticated step-by-step reasoning, planning, and high-quality "thinking" capabilities from our powerful 32B model directly to the compact 2B variant.
🌍 Large-Scale Embodied Pre-training: Grounded in a massive, specially curated dataset comprising \>100 million embodied and spatial-specific data points. Trained on a corpus exceeding 200 billion tokens, the model develops a deep, native understanding of 3D spaces, physical object interactions, and agent dynamics.
🦾 Stronger VLA Application: Beyond standard academic benchmarks, HY-Embodied is engineered to be the core cognitive engine for physical robots. It seamlessly integrates into Vision-Language-Action (VLA) frameworks, acting as a highly robust and capable brain to drive high success rates in complex, real-world robotic control tasks.

📅 Plannings

[x] Transformers Inference
[ ] vLLM Inference
[ ] Fine-tuning Code
[ ] Online Gradio Demo

🛠️ Dependencies and Installation

Prerequisites

🖥️ Operating System: Linux (recommended)
🐍 Python: 3.12+ (recommended and tested)
⚡ CUDA: 12.6
🔥 PyTorch: 2.8.0
🎮 GPU: NVIDIA GPU with CUDA support

Installation

1. Install the specific Transformers version required for this model:

pip install git+https://github.com/huggingface/transformers@9293856c419762ebf98fbe2bd9440f9ce7069f1a

> Note: We will merge the improvements into the Transformers main branch later.

2. Install other dependencies:

pip install -r requirements.txt

Quick Start

1. Clone the repository:

git clone https://github.com/Tencent-Hunyuan/HY-Embodied
cd HY-Embodied/

2. Install dependencies:

pip install -r requirements.txt

3. Run inference:

python inference.py

The example script demonstrates both single generation and batch generation capabilities.

Model Download

The code automatically downloads the model tencent/HY-Embodied-0.5 from Hugging Face Hub. Ensure you have sufficient disk space (8 GB) for the model weights.

Hardware Requirements

GPU: Recommended for optimal performance (NVIDIA GPU with at least 16GB VRAM)
CPU: Supported but slower
Memory: At least 16GB RAM recommended
Storage: 20GB+ free space for model and dependencies

🚀 Quick Start with Transformers

Basic Inference Example

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)

# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()

# Prepare input messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "./figures/example.jpg"},
{"type": "text", "text": "Describe the image in detail."},
],
}
]

# Process and generate
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
enable_thinking=THINKING_MODE,
).to(model.device)

with torch.no_grad():
generated_ids = model.generate(
**inputs,
max_new_tokens=32768,
use_cache=True,
temperature=TEMPERATURE,
do_sample=TEMPERATURE > 0,
)

output_ids = [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)]
print(processor.batch_decode(output_ids, skip_special_tokens=True)[0])

Batch Inference

import os
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor

# Load model & processor
MODEL_PATH = "tencent/HY-Embodied-0.5"
DEVICE = "cuda"
THINKING_MODE = False
TEMPERATURE = 0.8

processor = AutoProcessor.from_pretrained(MODEL_PATH)

# Load chat template if available
chat_template_path = os.path.join(MODEL_PATH, "chat_template.jinja")
if os.path.exists(chat_template_path):
processor.chat_template = open(chat_template_path).read()

model = AutoModelForImageTextToText.from_pretrained(MODEL_PATH, torch_dtype=torch.bfloat16)
model.to(DEVICE).eval()

# Batch Inference (multiple prompts at once)
messages_batch = [
# Sample A: image + text
[
{
"role": "user",
"content": [
{"type": "image", "image":...

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

Moderate traction, niche embodied model