nvidia/4D-RGPT-8B
Captured source
source ↗Model Overview
Description:
4D-RGPT is a specialized multimodal large language model that improves region-level 4D (i.e., 3D + time) video understanding by distilling latent and explicit 4D perceptual signals (for example, depth and optical flow) from a frozen expert model into an NVILA-based student model. 4D-RGPT was developed by NVIDIA as part of the NVILA visual-language model family and introduces Perceptual 4D Distillation (P4D), Timestamp Positional Encoding (TPE), and the companion R4D-Bench benchmark for region-level 4D VQA.
This model is for research and development only.
License/Terms of Use:
Use of this model is governed by the CC-BY-NC-4.0 License.
Deployment Geography:
Global
Use Case:
Expected users are multimodal AI researchers, applied research teams, and developers studying video understanding, region grounding, 3D/4D reasoning, and physical AI. Representative use cases include region-level video question answering, model benchmarking, research on depth-and-time-aware MLLMs, and prototyping for domains such as robotics, autonomous driving, and industrial inspection.
Release Date:
Hugging Face [06/01/2026] via https://huggingface.co/nvidia/4D-RGPT-8B.
References(s):
- Paper: https://arxiv.org/abs/2512.17012
- GitHub: https://github.com/NVlabs/4D-RGPT
- Project page: https://www.ca-joe-yang.com/resource/projects/4D_RGPT/
- R4D-Bench: https://huggingface.co/datasets/nvidia/R4D-Bench
Model Architecture:
Architecture Type: Transformer
Network Architecture: NVILA-Lite-based MLLM using a SigLIP vision encoder, multimodal projector, and language model.
This model was developed based on: NVILA-Lite-based MLLM
Number of model parameters: 8.0\*10^9 for 4D-RGPT-8B
Describe design choices related to initialization techniques, hyperparameter tuning, regularization techniques, model optimization, damping, and training parameters: 4D-RGPT adds a lightweight training-only MLP 4D perception decoder (hidden size 2,560) with GELU activations, Xavier weight initialization, and zero bias initialization. Training begins from pretrained NVILA weights.Ttotal loss combines SFT, latent distillation, and explicit distillation with Timestamp Positional Encoding uses T=10,000.
Input(s):
Input Type(s): Image, Text, Video
Input Format(s):
- Image: RGB
- Text: String
- Video: .mp4
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
- Video: Three-Dimensional (3D)
Other Properties Related to Input: The model is designed for video-question answering with explicit temporal cues encoded through timestamps for sampled frames. The paper uses sampled frame timestamps for TPE and, for fair comparison on R4D-Bench, evaluates open-source models using 16 sampled frames. Region-level evaluation uses region prompts represented through Set-of-Marks (SoM) or region masks in benchmark workflows.
Output(s)
Output Type(s): Text
Output Format(s):
- Text: String
Output Parameters:
- Text: One-Dimensional (1D)
Other Properties Related to Output: Outputs are text answers for 3D/4D VQA tasks, commonly multiple-choice selections, short phrases, or short numeric answers. The paper focuses on accuracy benchmarks rather than production API formatting. This model is designed to run on NVIDIA GPU-accelerated systems; the public training setup uses NVIDIA A100-SXM4-80GB GPUs.
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
- Not Applicable (N/A)- inference using NVILA
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere — A100-SXM4-80GB
Supported Operating System(s):
- [Linux]
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
This AI model can be embedded as an Application Programming Interface (API) call into the software environment described above.
Model Version(s):
- 4D-RGPT-8B — main paper model based on NVILA-Lite-8B; this is the primary reported configuration in the main results tables.
Training, Testing, and Evaluation Datasets:
Dataset Overview
Total Size: Approximately 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture, based on the paper-reported counts. This corresponds to approximately 2.06e5 unique visual items (about 190k images plus about 16.2k videos).
Total Number of Datasets: 4 training datasets.
General description of data processing: The training mixture comprises VSTI-Bench training data, the NuScenes portion of Wolf, RoboFAC, and SAT. For evaluation, this release reports results on the companion R4D-Bench benchmark, VLM4D-real, and VSTI-Bench.
Public Datasets
Training datasets:
- VSTI-Bench (training split): ~1.2k unique videos and ~130k QA pairs. Source videos are from ScanNet and ScanNet++.
- Wolf (NuScenes portion): ~5k unique videos and ~15k QA pairs derived from dense captions.
- RoboFAC: ~10k unique videos and ~65k conversations; simulated robotic-arm videos.
- SAT (training split): ~190k unique simulated images and ~170k QA pairs.
Evaluation datasets:
- R4D-Bench
- VLM4D-real
- VSTI-Bench
Training Dataset:
Data Modality:
- [Image]
- [Text]
- [Video]
Training Data Size:
Approx. 3.8e5 supervision examples / QA pairs / conversations across the disclosed training mixture.
Image Training Data Size
- Less than a Million Images
Text Training Data Size
- Less than a Billion Tokens
Video Training Data Size
- Less than 10,000 Hours
Data Collection Method by dataset
- Hybrid: Automatic/Sensors, Human, Synthetic
Labeling Method by dataset
- Hybrid: Human, Automated, Synthetic…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction; routine model release.