nvidia/dvlt
Captured source
source ↗Model Overview
Description
Déjà View Looping Transformer (DVLT) is a feed-forward three-dimensional (3D) reconstruction model that takes unposed Red, Green, Blue (RGB) images or video and predicts per-pixel depth, ray maps (and thus 3D points), and per-view camera intrinsics/extrinsics in a single pass.
Novelty: A weight-tied looped transformer — instead of stacking many distinct layers, a single shared block is applied for K refinement steps over a DINOv2-initialized per-view state, with each step conditioned on a continuous time interval (t_k, t_k+1) ⊂ [0, 1]. A single checkpoint exposes the iteration count K as an inference-time compute/quality knob without retraining separate models (released checkpoint valid for K ∈ [8, 16]).
This model is for research and development only.
License/Terms of Use
Model. The model (checkpoints, learned weights, and configuration files) is released under the NVIDIA License: https://huggingface.co/nvidia/dvlt/blob/main/LICENSE.txt.
Source code. The accompanying source code is licensed separately. The repository is primarily licensed under the Apache License, Version 2.0 — see the LICENSE file at the repository root. Portions of the codebase derived from VGGT (Meta) are distributed under the VGGT License v1; the full text is provided in LICENSES/VGGT-LICENSE.txt. Full third-party attribution, per-file notices, and upstream license texts are collected in THIRD_PARTY_LICENSES.md.
Deployment Geography
Global
Use Case
Primary users:
- Computer Vision Researchers: For benchmarking multi-view 3D reconstruction, studying weight-tied / recurrent transformer architectures, and developing neural rendering pipelines.
- Augmented Reality/Virtual Reality & Robotics Engineers: For real-time simultaneous localization and mapping (SLAM), scene understanding, and navigation research prototypes.
- 3D Content Creators: For rapid conversion of unposed video/image collections into 3D assets.
Primary Use Cases:
- 3D Reconstruction: Fast, feed-forward estimation of dense per-pixel depth, ray maps, and per-view camera poses from unposed images or video, without iterative optimization.
- Structure-from-motion (SfM) Replacement: Accelerating initialization for 3D Gaussian Splatting and Neural Radiance Field (NeRF) training by replacing slow SfM pipelines (e.g., COLMAP).
- Compute-Adaptive Inference: A single checkpoint supports a range of recurrence step counts (8–16) at inference, letting downstream applications trade reconstruction quality for latency without retraining.
Release Date
GitHub 06/02/2026 via [URL TBD]
Hugging Face (HF) 06/02/2026 via https://huggingface.co/nvidia/dvlt
References
Quick Start
Install the code (preferably in a conda environment):
conda create -n dvlt python=3.12 && conda activate dvlt conda install pytorch=2.5.1 torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge pip install -e .[all]
Run feed-forward 3D reconstruction on a directory of images, a video, or an explicit list of frames:
import torch
from accelerate import Accelerator
from dvlt.model.dvlt.model import DVLT
from dvlt.util.preprocess import load_sequence, preprocess_images
accelerator = Accelerator(mixed_precision="bf16")
model = DVLT(img_size=504)
model.load_pretrained("nvidia/dvlt") # local dir, HTTPS URL, or HF Hub repo id
model.setup_test(accelerator)
# load_sequence accepts a directory, a single video, or an explicit list of files.
_, frames = load_sequence("path/to/scene_dir")
batch = preprocess_images(frames, img_size=504, patch_size=14, device=accelerator.device)
with torch.no_grad(), accelerator.autocast():
predictions = model.predict(batch, accelerator)
cameras = predictions["cameras"][0] # Cameras object with shape [S]
extrinsics_c2w = cameras.camera_to_worlds # (S, 3, 4) — OpenCV convention [R | t]
intrinsics = cameras.get_intrinsics_matrices() # (S, 3, 3)
depths = predictions["depths"][0] # (S, H, W)
world_points = predictions["world_points"][0] # (S, H, W, 3)Model Architecture
Architecture Type: Transformer
Network Architecture: Vision Transformer (ViT) with a single weight-tied looped block applied recurrently for K refinement steps.
Backbone: DINOv2 ViT-B (patch size 14).
Number of model parameters: 1.17 * 10^8
Input(s)
Input Type(s): Image, video
Input Format(s):
- Image: Red, Green, Blue (RGB)
- Video: .mov, .mp4 (decoded to frames)
Input Parameters:
- Image collection: Three-Dimensional (3D) — a sequence/stack of two-dimensional (2D) RGB images per scene.
- Video: Three-Dimensional (3D) — a sequence of two-dimensional (2D) frames per clip.
Other Properties Related to Input:
- Video is decoded to frames at the video's native frame rate.
- Training resolution: 504-pixel longest edge; sides padded to multiples of patch size 14.
- Number of views per scene at training time: V ∈ [2, 18]; inference supports a wider range (memory-bound).
Output(s)
Output Type(s):
- Per-pixel depth map
- Per-pixel ray map (origin + unnormalized direction)
- Per-view camera parameters (extrinsics, intrinsics)
Output Format(s):
- Depth map: scalar metric distance per pixel
- Ray map: 3D origin + 3D direction per pixel (6 channels)
- Point cloud (derived analytically as X = R^o + D · R^d, i.e. ray origin plus depth times ray direction): X, Y, Z per pixel
- Camera intrinsics: focal length, principal point
- Camera extrinsics: rotation matrix (camera-to-world), translation vector
Output Parameters:
- Depth / confidence: Two-Dimensional (2D) per view (H × W)
- Ray map / point cloud: Three-Dimensional (3D) per view (H × W × 3)
- Camera intrinsics: Two-Dimensional (2D) (3 × 3)
- Camera extrinsics: Two-Dimensional (2D) (4 × 4)
Other Properties Related to Output:
- One 3D point, one depth value, one ray, and one depth-confidence value are predicted per input pixel in each image.
- One set of camera parameters (intrinsics + extrinsics) is predicted per input image, expressed in the coordinate frame of the first view.
- Cameras are recovered from the predicted ray maps following Lin et al. 2025; a camera multi-layer perceptron…
Excerpt shown — open the source for the full document.
Notability
notability 3.0/10Low traction (270 downloads), routine release