nvidia/NV-OneFormer
Captured source
source ↗Model Overview
Description:
One Transformer to Rule all Segmentation Tasks or OneFormer is a universal segmentation architecture capable of addressing panoptic, instance, or semantic image segmentation tasks. The task token is essential in dynamically guiding the model to output task-specific predictions by conditioning the architecture on the desired segmentation type (e.g., "semantic," "instance," or "panoptic") during a single, unified training and inference process.
This model is ready for commercial use.
License/Terms of Use:
GOVERNING TERMS: Use of this model is governed by the NVIDIA Open Model Agreement
Deployment Geography:
Global
Use Case:
Intended Users: This model is intended for use by computer vision engineers, robotics engineers, and researchers who require a comprehensive and detailed pixel-level understanding of an image.
Intended Use Cases: The model's ability to perform semantic, instance, and panoptic segmentation simultaneously makes it ideal for:
- Autonomous Systems: Providing full scene understanding for self-driving cars or drones by identifying both "stuff" (road, sky) and "things" (car 1, car 2, pedestrian 1).
- Robotic Perception: Enabling robots to identify, locate, and separate individual objects for "pick-and-place" tasks in cluttered environments.
- Medical Image Analysis: Segmenting and counting individual cells or tumors (instances) while also classifying surrounding anatomical regions or tissues (semantic).
- Geospatial Analysis: Analyzing satellite imagery to map land use (e.g., "forest," "water") while also detecting and counting individual objects (e.g., "buildings," "vehicles").
- Computational Photography: Powering features like AR effects or portrait mode by creating precise masks that separate subjects from their background.
Release Date:
NGC 05/25/2026 via [URL]
References(s):
J. Jain, J. Li, M. Chiu, A. Hassani, N. Orlov, H. Shi: OneFormer: One Transformer to Rule Universal Image Segmentation
Model Architecture:
Architecture Type: The model is a unified segmentor that takes color (RGB) images as inputs and generates segmentation masks and associated labels as outputs.
Network Architecture:
- The backbone feature extractor of this model is the DiNaT-L model.
- The multi-scale features from the backbone are then fed into a Pixel Decoder (similar to an FPN) to generate high-resolution, multi-scale feature maps.
- The core of the architecture is a transformer decoder that takes two sets of inputs: the multi-scale feature maps and a set of learnable queries.
- A key innovation of OneFormer is the use of a task token. This single, learnable token is added to the queries to "prompt" the model, conditioning it to perform a specific task (semantic, instance, or panoptic segmentation) using the exact same weights.
- Finally, the refined query embeddings from the decoder are passed to two parallel prediction heads:
- A classification head (a linear layer or small MLP) to predict the class label for each query.
- A mask head (also an MLP) to dynamically generate the final mask for each query by combining the decoder outputs with the pixel decoder's feature maps.
More Details: The models in this instance are universal segmentation models that take RGB images and a text string (e.g., "The task is semantic") as input, and produce masks and classes as output. More specifically, this model was trained with a DiNaT-Large backbone that was trained in a supervised manner on NVIDIA proprietary data called NVImageNet, which allows commercial usage. The text input is used to generate a task token that conditions the model to perform a specific segmentation task (semantic, instance, or panoptic) using the same set of weights. Finally, OneFormer was trained and finetuned on a combination of datasets, including OpenImages, ITS. Note that we ensured that all the raw images used during training have commercial licenses to ensure safe commercial usage.
Number of model parameters: 230*10^7
Input(s):
Input Type(s): Image, Text
Input Format(s):
- Image: Red, Green, Blue (RGB)
- Text: String
Input Parameters:
- Image: Two-Dimensional (2D)
- Text: One-Dimensional (1D)
Other Properties Related to Input: The image size should be divisible by 32, and the text should state "This task is semantic/instance/panoptic."
Output(s)
Output Type(s): Label, Mask and Score for each detected object in the input image.
Output Format(s):
- Label: Integer
- Mask: Red, Green, Blue (RGB)
- Score: Float
Output Parameters:
- Label: One-Dimensional (1D)
- Mask: Two-Dimensional (2D)
- Score: One-Dimensional (1D)
Other Properties Related to Output:
pred_classes: Batch size x Number of queriespred_masks: Batch size x Number of queries x Height x Widthpred_scores: Batch size x Number of queries
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
Software Integration:
Runtime Engine(s):
- TAO v6.25.11
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Blackwell
- NVIDIA Jetson
- NVIDIA Hopper
- NVIDIA Lovelace
- NVIDIA Pascal
- NVIDIA Turing
- NVIDIA Volta
Preferred/Supported Operating System(s):
- Linux
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.
Model Version(s):
v2.0
Training, Testing, and Evaluation Datasets:
Dataset Overview
- Total Number of Datasets: 04
++ Training: NV-ImageNet, Subset of OpenImagesv5, Subset of MSCOCO 2017, ITS (train)
++ Validation: ITS (val), COCO 2017 validation set.
- Data Modality: Image
Images are scaled,…
Excerpt shown — open the source for the full document.
Notability
notability 7.0/10Notable model release from Nvidia, moderate impact.