ModelNVIDIANVIDIApublished May 29, 2026seen 5d

Captured source

source ↗
published May 29, 2026seen 5dcaptured 14hhttp 200method plaintask image-to-3dlicense otherlibrary dvltparams 117Mdownloads 319likes 37

Model Overview

Description

Déjà View Looping Transformer (DVLT) is a feed-forward three-dimensional (3D) reconstruction model that takes unposed Red, Green, Blue (RGB) images or video and predicts per-pixel depth, ray maps (and thus 3D points), and per-view camera intrinsics/extrinsics in a single pass.

Novelty: A weight-tied looped transformer — instead of stacking many distinct layers, a single shared block is applied for K refinement steps over a DINOv2-initialized per-view state, with each step conditioned on a continuous time interval (t_k, t_k+1) ⊂ [0, 1]. A single checkpoint exposes the iteration count K as an inference-time compute/quality knob without retraining separate models (released checkpoint valid for K ∈ [8, 16]).

This model is for research and development only.

License/Terms of Use

Model. The model (checkpoints, learned weights, and configuration files) is released under the NVIDIA License: https://huggingface.co/nvidia/dvlt/blob/main/LICENSE.txt.

Source code. The accompanying source code is licensed separately. The repository is primarily licensed under the Apache License, Version 2.0 — see the LICENSE file at the repository root. Portions of the codebase derived from VGGT (Meta) are distributed under the VGGT License v1; the full text is provided in LICENSES/VGGT-LICENSE.txt. Full third-party attribution, per-file notices, and upstream license texts are collected in THIRD_PARTY_LICENSES.md.

Deployment Geography

Global

Use Case

Primary users:

  • Computer Vision Researchers: For benchmarking multi-view 3D reconstruction, studying weight-tied / recurrent transformer architectures, and developing neural rendering pipelines.
  • Augmented Reality/Virtual Reality & Robotics Engineers: For real-time simultaneous localization and mapping (SLAM), scene understanding, and navigation research prototypes.
  • 3D Content Creators: For rapid conversion of unposed video/image collections into 3D assets.

Primary Use Cases:

  • 3D Reconstruction: Fast, feed-forward estimation of dense per-pixel depth, ray maps, and per-view camera poses from unposed images or video, without iterative optimization.
  • Structure-from-motion (SfM) Replacement: Accelerating initialization for 3D Gaussian Splatting and Neural Radiance Field (NeRF) training by replacing slow SfM pipelines (e.g., COLMAP).
  • Compute-Adaptive Inference: A single checkpoint supports a range of recurrence step counts (8–16) at inference, letting downstream applications trade reconstruction quality for latency without retraining.

Release Date

GitHub 06/02/2026 via [URL TBD]

Hugging Face (HF) 06/02/2026 via https://huggingface.co/nvidia/dvlt

References

Quick Start

Install the code (preferably in a conda environment):

conda create -n dvlt python=3.12 && conda activate dvlt
conda install pytorch=2.5.1 torchvision pytorch-cuda=12.4 -c pytorch -c nvidia -c conda-forge
pip install -e .[all]

Run feed-forward 3D reconstruction on a directory of images, a video, or an explicit list of frames:

import torch
from accelerate import Accelerator

from dvlt.model.dvlt.model import DVLT
from dvlt.util.preprocess import load_sequence, preprocess_images

accelerator = Accelerator(mixed_precision="bf16")

model = DVLT(img_size=504)
model.load_pretrained("nvidia/dvlt") # local dir, HTTPS URL, or HF Hub repo id
model.setup_test(accelerator)

# load_sequence accepts a directory, a single video, or an explicit list of files.
_, frames = load_sequence("path/to/scene_dir")
batch = preprocess_images(frames, img_size=504, patch_size=14, device=accelerator.device)

with torch.no_grad(), accelerator.autocast():
predictions = model.predict(batch, accelerator)

cameras = predictions["cameras"][0] # Cameras object with shape [S]
extrinsics_c2w = cameras.camera_to_worlds # (S, 3, 4) — OpenCV convention [R | t]
intrinsics = cameras.get_intrinsics_matrices() # (S, 3, 3)
depths = predictions["depths"][0] # (S, H, W)
world_points = predictions["world_points"][0] # (S, H, W, 3)

Model Architecture

Architecture Type: Transformer

Network Architecture: Vision Transformer (ViT) with a single weight-tied looped block applied recurrently for K refinement steps.

Backbone: DINOv2 ViT-B (patch size 14).

Number of model parameters: 1.17 * 10^8

Input(s)

Input Type(s): Image, video

Input Format(s):

  • Image: Red, Green, Blue (RGB)
  • Video: .mov, .mp4 (decoded to frames)

Input Parameters:

  • Image collection: Three-Dimensional (3D) — a sequence/stack of two-dimensional (2D) RGB images per scene.
  • Video: Three-Dimensional (3D) — a sequence of two-dimensional (2D) frames per clip.

Other Properties Related to Input:

  • Video is decoded to frames at the video's native frame rate.
  • Training resolution: 504-pixel longest edge; sides padded to multiples of patch size 14.
  • Number of views per scene at training time: V ∈ [2, 18]; inference supports a wider range (memory-bound).

Output(s)

Output Type(s):

  • Per-pixel depth map
  • Per-pixel ray map (origin + unnormalized direction)
  • Per-view camera parameters (extrinsics, intrinsics)

Output Format(s):

  • Depth map: scalar metric distance per pixel
  • Ray map: 3D origin + 3D direction per pixel (6 channels)
  • Point cloud (derived analytically as X = R^o + D · R^d, i.e. ray origin plus depth times ray direction): X, Y, Z per pixel
  • Camera intrinsics: focal length, principal point
  • Camera extrinsics: rotation matrix (camera-to-world), translation vector

Output Parameters:

  • Depth / confidence: Two-Dimensional (2D) per view (H × W)
  • Ray map / point cloud: Three-Dimensional (3D) per view (H × W × 3)
  • Camera intrinsics: Two-Dimensional (2D) (3 × 3)
  • Camera extrinsics: Two-Dimensional (2D) (4 × 4)

Other Properties Related to Output:

  • One 3D point, one depth value, one ray, and one depth-confidence value are predicted per input pixel in each image.
  • One set of camera parameters (intrinsics + extrinsics) is predicted per input image, expressed in the coordinate frame of the first view.
  • Cameras are recovered from the predicted ray maps following Lin et al. 2025; a camera multi-layer perceptron…

Excerpt shown — open the source for the full document.

Notability

notability 3.0/10

Low traction (270 downloads), routine release