What does this model signal mean?

NVIDIA published nvidia/Cosmos3-Super-Image2Video. This model signal is evidence of what shipped on model infrastructure and how the release is positioned. High-signal details: license other · 15.9K HF downloads · NVIDIA image-to-video model, moderate traction.. onlylabs links this event to 1 captured evidence page and 6 related model signals.

NVIDIA Model: nvidia/Cosmos3-Super-Image2Video

Captured source

source ↗

Hugging Face/huggingface.co/nvidia/Cosmos3-Super-Image2Video

nvidia/Cosmos3-Super-Image2Video model card

Source ↗

published May 21, 2026seen 5dcaptured 11hhttp 200method plaintask image-to-videolicense otherlibrary cosmosparams 65Bdownloads 16klikes 120

Cosmos 3: Omnimodal World Models for Physical AI

[Model Collection](https://huggingface.co/collections/nvidia/cosmos3) | [Code](https://github.com/nvidia/cosmos) | [White Paper](https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf) | [Website](https://research.nvidia.com/labs/cosmos-lab/cosmos3/)

NVIDIA Cosmos™ is a world foundation model platform designed to accelerate the development of Physical AI by enabling machines to understand, simulate, and interact with the physical world across robotics, autonomous driving, and smart space environments, including industrial and factory-scale applications.

Model Overview: Cosmos3-Super-Image2Video

Description

Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.

This model is ready for commercial and non-commercial use.

Model Developer: NVIDIA

Model Versions

Cosmos3-Nano:
Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.

Cosmos3-Super:
Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.

Cosmos3-Nano-Policy-DROID:
Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.

Cosmos3-Super-Image2Video:
Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.

Cosmos3-Super-Text2Image:
Given text input, generate high-fidelity images that are consistent with the provided description.

License

This model is released under the OpenMDW1.1

Deployment Geography

Global

Use Case

Physical AI: Encompassing robotics, autonomous vehicles (AV), and smart space environments, including industrial and factory-scale applications.

Release Date

Hugging Face 05/31/2026 via https://huggingface.co/collections/nvidia/cosmos3 GitHub 05/31/2026 via https://github.com/nvidia/cosmos

Model Architecture

Architecture Type: Transformer

Network Architecture: Mixture-of-Transformers (MoT)

Cosmos3 is an Omni-modal foundation model built on a Mixture-of-Transformers (MoT) architecture consisting of two complementary transformer towers: an autoregressive transformer for discrete token generation and a diffusion transformer for continuous multimodal generation. During inference, text is generated through standard next-token autoregressive decoding, while non-text modalities, such as images, video, audio, and actions, are synthesized through iterative denoising. This unified architecture enables Cosmos3 to model heterogeneous modalities within a single framework while preserving generation mechanisms best suited to each modality.

This model was developed based on: Cosmos Framework

Number of trainable model parameters:

Cosmos3-Nano: 16B
Cosmos3-Super: 64B
Cosmos3-Nano-Policy-DROID: 16B
Cosmos3-Super-Image2Video: 64B
Cosmos3-Super-Text2Image: 64B

Input/Output Specifications

Generator Input
Input Type(s): Text, Image, Video (with audio or without audio), Action Trajectory
Input Format(s):
Text: String
Image: jpg, png, jpeg, webp
Video (with or without audio): mp4
Action: json (1D list)
Input Parameters:
Text: One-dimensional (1D)
Image: Two-dimensional (2D)
Video: Three-dimensional (3D)
Audio: One-dimensional (1D)
Action trajectory: One-dimensional (1D)
Other Properties Related to Input:
For video inputs, we accept various resolutions, including 720p, 480p, and 256p.
When using input video with audio muxed into the video MP4 file, the audio should have 2 channels (stereo) and a 48 kHz sample rate.
Image and video inputs are RGB color (8 bits per channel, sRGB color space); grayscale inputs are not supported.
Action input is a per-frame sequence of robot/agent state or control values (e.g., joint positions, gripper state, camera pose). The full input is a 2D array shaped (T, D), where T is the number of frames and D is the embodiment-specific dimensionality listed below.
Input action is only supported for compatible embodiments, including general camera motion (9D), autonomous vehicle (9D), egocentric motion (57D), single Franka Panda arm with RobotiQ gripper (10D), dual Franka Panda arm with RobotiQ gripper (20D), Agibot (29D), UR (10D), Google robot (10D), WidowX 250 (10D), UMI (9D).
Input Size and Length limits:
Text: 4096 tokens
Image: 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16)
Video: 256p, 480p, and 720p resolution at one of these aspect ratios (16:9, 4:3, 1:1, 3:4, 9:16). Max number of frames = 5.
Audio: Max 0.5 second
Action: 16 – 400 video frames
Generator Output
Output Type(s): Image, video, audio, action, text
Output Format(s):
Image: JPG
Video: MP4
Audio: Advanced Audio Coding (AAC) stream (muxed within the MP4)
Action: 1D list (.json)
Text: string
Output Parameters:
Image: Two-dimensional (2D)
Video: Three-dimensional (3D)
Audio: One-dimensional (1D)
Action: One-dimensional (1D)
Text: One-dimensional (1D)
Other Properties Related to Output:
The generated video is an MP4 file, with the resolution, frame rate, and duration specified in the input. The generated audio is encoded in AAC format, muxed into the video MP4 file with 2 channels (stereo) and a 48 kHz sample rate.
Video generation supports durations from 5 to 400 frames, with 189 frames as the default generation duration.
The…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

NVIDIA image-to-video model, moderate traction.