nvidia/Cosmos3-Super-Image2Video
Captured source
source ↗Cosmos 3: Omnimodal World Models for Physical AI
[Model Collection](https://huggingface.co/collections/nvidia/cosmos3) | [Code](https://github.com/nvidia/cosmos) | [White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf) | [Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)
NVIDIA Cosmos™ is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.
Model Overview: Cosmos3-Super-Image2Video
Description
Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.
This model is ready for commercial and non-commercial use.
Model Developer: NVIDIA
Model Versions
- Cosmos3-Nano:
- Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
- Cosmos3-Super:
- Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
- Cosmos3-Nano-Policy-DROID:
- Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.
- Cosmos3-Super-Image2Video:
- Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.
- Cosmos3-Super-Text2Image:
- Given text input, generate high-fidelity images that are consistent with the provided description.
License
This model is released under the OpenMDW1.1
Deployment Geography
Global
Use Case
Physical AI: Encompassing robotics, autonomous vehicles (AV), and smart space environments, including industrial and factory-scale applications.
Release Date
Hugging Face 05/31/2026 via https://huggingface.co/collections/nvidia/cosmos3 GitHub 05/31/2026 via https://github.com/nvidia/cosmos
Model Architecture
Architecture Type: Transformer
Network Architecture: Mixture-of-Transformers (MoT)
Cosmos3 is an Omni-modal foundation model built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. During inference, text is generated through standard next-token autoregressive decoding, while non-text modalities, such as images, video, audio, and actions, are synthesized through iterative denoising. This unified architecture enables Cosmos3 to model heterogeneous modalities within a single framework while preserving generation mechanisms best suited to each modality.
This model was developed based on: Cosmos Framework
Number of trainable model parameters:
- Cosmos3-Nano: 16B
- Cosmos3-Super: 64B
- Cosmos3-Nano-Policy-DROID: 16B
- Cosmos3-Super-Image2Video: 64B
- Cosmos3-Super-Text2Image: 64B
Input/Output Specifications
- Generator Input
- Input Type(s): Text, Image, Video (with audio or without audio), Action Trajectory
- Input Format(s):
- Text: String
- Image: jpg, png, jpeg, webp
- Video (with or without audio): mp4
- Action: json (1D list)
- Input Parameters:
- Text: One-dimensional (1D)
- Image: Two-dimensional (2D)
- Video: Three-dimensional (3D)
- Audio: One-dimensional (1D)
- Action trajectory: One-dimensional (1D)
- Other Properties Related to Input:
- For video inputs, we accept various resolutions, including 720p, 480p, and 256p.
- When using input video with audio muxed into the video MP4 file, the audio should have 2 channels (stereo) and a 48 kHz sample rate.
- Image and video inputs are RGB color (8 bits per channel, sRGB color space); grayscale inputs are not supported.
- Action input is a per-frame sequence of robot/agent state or control values (e.g., joint positions, gripper state, camera pose). The full input is a 2D array shaped (T, D), where T is the number of frames and D is the embodiment-specific dimensionality listed below.
- Input action is only supported for compatible embodiments, including general camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single Franka Panda arm with RobotiQ gripper (10D), dual Franka Panda arm with RobotiQ gripper (20D), Agibot (29D), UR (10D), Google robot (10D), WidowX 250 (10D), UMI (9D).
- Input Size and Length limits:
- Text: 4096 tokens
- Image: 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16)
- Video: 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16). Max number of frames = 5.
- Audio: Max 0.5 second
- Action: 16 – 400 video frames
- Generator Output
- Output Type(s): Image, video, audio, action, text
- Output Format(s):
- Image: JPG
- Video: MP4
- Audio: Advanced Audio Coding (AAC) stream (muxed within the MP4)
- Action: 1D list (.json)
- Text: string
- Output Parameters:
- Image: Two-dimensional (2D)
- Video: Three-dimensional (3D)
- Audio: One-dimensional (1D)
- Action: One-dimensional (1D)
- Text: One-dimensional (1D)
- Other Properties Related to Output:
- The generated video is an MP4 file, with the resolution, frame rate, and duration specified in the input. The generated audio is encoded in AAC format, muxed into the video MP4 file with 2 channels (stereo) and a 48 kHz sample rate.
- Video generation supports durations from 5 to 400 frames, with 189 frames as the default generation duration.
- The…
Excerpt shown — open the source for the full document.
Notability
notability 6.0/10NVIDIA image-to-video model, moderate traction.