RepoOpenBMB (MiniCPM)OpenBMB (MiniCPM)published Oct 13, 2025seen 5d

OpenBMB/DeepThinkVLA

Python

Open original ↗

Captured source

source ↗
published Oct 13, 2025seen 5dcaptured 10hhttp 200method plain

OpenBMB/DeepThinkVLA

Description: DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

Language: Python

License: MIT

Stars: 525

Forks: 48

Open issues: 3

Created: 2025-10-13T04:37:37Z

Pushed: 2026-04-16T10:43:05Z

Default branch: main

Fork: no

Archived: no

README:

🔥 DeepThinkVLA 🔥

Enhancing Reasoning Capability of Vision-Language-Action Models

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

🔗 Quick Links

  • [Overview](#overview)
  • [Highlights](#highlights)
  • [Architecture](#architecture)
  • [Embodied CoT Dataset](#embodied-cot-dataset)
  • [Training Pipeline](#training-pipeline)
  • [Performance](#performance)
  • [LIBERO Plus Zero-shot Evaluation](#-libero-zero-shot-evaluation)
  • [Qualitative Behavior](#qualitative-behavior)
  • [Setup](#setup)
  • [Data & Checkpoints](#data--checkpoints)
  • [Experiments](#experiments)
  • [Repository Structure](#repository-structure)
  • [Star History](#star-history)
  • [Acknowledgements](#acknowledgements)
  • [References](#references)

📰 News

📝 TODO

  • [x] LIBERO benchmark
  • [x] LIBERO Plus zero-shot evaluation
  • [ ] RobotWin benchmark
  • [ ] Real-world hardware experiments

🧠 Overview

DeepThinkVLA rethinks Vision-Language-Action (VLA) policies with explicit deliberation. Starting from the public pi0-FAST checkpoint, we refactor the policy into a 2.9B parameter hybrid decoder that writes a reasoning trace before emitting action chunks. The accompanying paper combines embodied Chain-of-Thought (CoT) supervised fine-tuning with outcome-driven reinforcement learning, yielding a 97.0% average success rate across the LIBERO benchmark (Object 99.0, Spatial 96.6, Goal 96.4, Long 96.2). The hybrid architecture alone lifts success by 15.5 percentage points over a naive autoregressive CoT variant, and the RL refinement supplies the final +2.0 point boost on LIBERO-Long.

✨ Highlights

  • Hybrid attention decoder cleanly separates autoregressive reasoning from parallel action generation, closing the latency gap while keeping control precise.
  • Two-stage CoT data engine distills key frames with a cloud LVLM and scales to full trajectories via a fine-tuned local VLM.
  • Outcome-based RL with grouped credit assignment aligns the full think-act sequence and stabilizes updates with KL regularization to the SFT policy.
  • Masked-CoT(DeepThinkVLA) inference preserves accuracy (96.5% average SR) while running 0.175x the latency of pi0-FAST(Autoregressive), whereas random CoT quickly degrades performance (85.1%).

🏗️ Architecture

![Hybrid attention architecture](figs/fig2.png)

DeepThinkVLA inserts a `` segment between observations and actions. Reasoning tokens are generated autoregressively, after which the decoder switches to bidirectional attention to emit action vectors in parallel. This resolves the modality conflict that limits single-decoder baselines and enables efficient rollouts for downstream reinforcement learning.

📦 Embodied CoT Dataset

![Two-stage CoT curation](figs/fig1.png)

A scalable annotation pipeline supplies paired reasoning/action traces:

  • Stage 1 isolates key frames via gripper-state heuristics, queries a cloud LVLM for high-quality CoT, and performs targeted human review.
  • Stage 2 fine-tunes a local VLM on those exemplars and auto-labels the remaining frames, applying schema and temporal checks to keep trajectories coherent.

🔄 Training Pipeline

![Two-stage training with RL alignment](figs/fig3.png)

Training proceeds in two stages:

  • SFT cold start: token-level cross-entropy teaches the hybrid decoder to produce well-formed CoT and aligned actions under causal/bidirectional masks.
  • Outcome-driven RL: grouped reinforcement policy optimization (GRPO) standardizes sparse rewards inside task-conditioned batches, while a KL penalty to the SFT policy prevents drift. The RL stage adds +2.0 SR on LIBERO-Long and strengthens the causal link between thought and action.

📊 Performance

![Effect of RL and architecture choices](figs/fig4.png)

  • DeepThinkVLA reaches a 97.0% average success rate across LIBERO, outperforming autoregressive, diffusion, and parallel-decoding baselines under the single-model protocol.
  • RL-over-SFT lifts LIBERO-Long from 94.2% to 96.2% without extra demonstrations, demonstrating recoveries on long-horizon tasks.
  • The hybrid decoder outperforms the naive autoregressive CoT variant by 15.5 points and keeps latency manageable; Mask CoT inference keeps accuracy while running 0.175x pi0-FAST latency.

🧪 LIBERO Plus Zero-shot Evaluation

We additionally report zero-shot transfer performance on LIBERO Plus:

  • Training: the model is trained only on the standard LIBERO dataset (no LIBERO Plus fine-tuning).
  • Evaluation: the trained model is directly evaluated on LIBERO Plus (zero-shot).
  • Eval scripts: we maintain a lightweight, standalone evaluation repo here:
  • `wadeKeith/DeepThinkVLA_libero_plus`

Run (in the LIBERO Plus eval repo)

python experiments/run_libero_plus_eval.py \
--pretrained_checkpoint /path/to/deepthinkvla_libero_checkpoint \
--num_images_in_input 2 \
--task_suite_name libero_10 \
--max_new_tokens 2048 \
--swanlab_mode disabled

Or use the wrapper:

bash eval.sh

Outputs

  • Logs: experiments/logs/
  • Rollout videos (if enabled): rollouts/

Zero-shot Results (LIBERO Plus)

The following numbers are zero-shot success rates (SR) on LIBERO Plus, evaluated with a DeepThinkVLA model trained only on LIBERO (no LIBERO Plus fine-tuning).

Breakdown by shift type

| Objects Layout | Language Instructions | Light Conditions | Camera Viewpoints | Robot Initial States | Background Textures | Sensor Noise | Total | | -------------- | --------------------- | ---------------- | ----------------- | -------------------- | ------------------- | ------------ | ----- | | 0.7993 | 0.845 | 0.900 | 0.885 | 0.405 | 0.753 | 0.944 | 0.790 |

Breakdown by task suite

| object | spatial | goal | 10 | Total | | ------ | ------- | ----- | ----- | ----- | | 0.840 | 0.879 | 0.697 | 0.746 | 0.790 |

🎬 Qualitative Behavior

![Reasoning-enabled recovery](figs/fig5.png) Deliberate…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New VLA model repo, moderate stars