ModelNVIDIANVIDIApublished May 25, 2026seen 5d

nvidia/Cosmos-AnomalyGen-Metal-2B

Open original ↗

Captured source

source ↗
published May 25, 2026seen 5dcaptured 15hhttp 200method plainlicense otherdownloads 0likes 3

Model Overview

Description:

Cosmos AnomalyGen — Metal Surface (UC2) generates synthetic magnetic-tile-surface anomaly images by inpainting a user-supplied binary mask onto a clean reference tile image, conditioned on one of five trained defect types (Blowhole, Break, Crack, Fray, Uneven). The release ships only the few-shot-finetuned modules — a set of anomaly-token embeddings and a 2-layer MLP adapter — which plug into the frozen Cosmos-Predict2 2B Text-to-Image diffusion backbone (also using a frozen NV-DINOv2 mask encoder and a frozen T5 text encoder) at inference time. Cosmos AnomalyGen — UC2 v1.0.0 was developed by NVIDIA as part of the Cosmos AnomalyGen pipeline. This model is ready for commercial use.

License/Terms of Use:

Governing Terms: Use of this model is governed by the NVIDIA Open Model Agreement.

Deployment Geography:

Global

Use Case:

Industrial visual-inspection teams responsible for magnetic-tile / metal-surface QA who have very few (≤5 per defect type) real anomaly examples. The model produces large-scale synthetic anomaly datasets (clean tile + binary mask → realistic Blowhole / Break / Crack / Fray / Uneven image) for training downstream defect-detection or segmentation models, including downstream TAO toolkit consumers via the DAFT v3.0 export path.

Release Date:

Github 06/02/2026 via https://github.com/NVIDIA/paidf-anomalygen

References(s):

  • Anomaly Diffusion (AAAI 2024) — paper: https://arxiv.org/abs/2312.05767, code: https://github.com/sjtuplayer/anomalydiffusion
  • Cosmos-Predict2 — https://github.com/nvidia-cosmos/cosmos-predict2
  • NV-DINOv2 classification model — https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/nv_dinov2_classification_model
  • Magnetic Tile Defect dataset — https://github.com/abin24/Magnetic-tile-defect-datasets

Model Architecture:

Architecture Type: Transformer (diffusion DiT backbone with learnable conditioning modules)

Network Architecture:

  • anomaly_embedding *(trainable, included in this release)*: token embeddings (256 tokens per + pair) — five pairs trained for UC2: metal_surface+MT_Blowhole, metal_surface+MT_Break, metal_surface+MT_Crack, metal_surface+MT_Fray, metal_surface+MT_Uneven.
  • adapter *(trainable, included in this release)*: 2-layer MLP with GELU activations (input / output hidden size = 1024), projecting the mask encoder output into the diffusion DiT conditioning space.
  • mask_encoder *(frozen, not redistributed in this release)*: NV-DINOv2 (ViT-L) backbone with adaptive pool (kernel = 7); weights are loaded from the separately downloaded NV-DINOv2 classification checkpoint at inference time.
  • text_encoder *(frozen, not redistributed in this release)*: google-t5/t5-large.
  • These modules condition the frozen Cosmos-Predict2 2B T2I DiT denoiser at inference time.

This model was developed based on Cosmos-Predict2-2B-Text2Image.

Number of model parameters: Approximately 3.4×10^6 (3.4 million) trainable parameters in the released modules — anomaly_embedding ≈ 1.3M (256 tokens × 1024 hidden × 5 + pairs) plus the 2-layer MLP adapter ≈ 2.1M (1024→1024 with GELU). The trainable modules are distributed as the model/iter_000010000.pt checkpoint file. The frozen Cosmos-Predict2 2B base contributes ~2.0×10^9 (2 billion) parameters used at inference time but not redistributed in this release.

Input(s):

Input Type(s): Image, Binary Mask, Text

Input Format(s):

  • Image: PNG / JPG, Red, Green, Blue (RGB)
  • Binary Mask: PNG / JPG, single-channel binary (0 = background, 255 = anomaly region; binarized at threshold 127)
  • Text: anomaly-type string in the form + (one of metal_surface+MT_Blowhole, metal_surface+MT_Break, metal_surface+MT_Crack, metal_surface+MT_Fray, metal_surface+MT_Uneven)

Input Parameters:

  • Image: Two-Dimensional (2D)
  • Mask: Two-Dimensional (2D)
  • Text: One-Dimensional (1D)

Other Properties Related to Input: Input clean image and paired mask must have the same dimensions; the model was trained at 512×512 and inference is run at the same resolution. anomaly_type must exactly match one of the five pairs trained for this UC2 checkpoint — passing an unsupported defect string is rejected by scripts/anomaly_gen/sdg-inference/validate_jsonl.py against this checkpoint's ag_config.yaml → dataloader_train.dataset.anomaly_types. The mask should ideally cover a contiguous defect region that resembles the trained mask distribution; the optional Automatic Mask Placement (AMP) tool can constrain placement to legal ROIs.

Output(s)

Output Type(s): Image

Output Format(s): PNG; Red, Green, Blue (RGB)

Output Parameters: Two-Dimensional (2D)

Other Properties Related to Output: 512×512 RGB synthetic anomaly image. Anomaly content is generated inside the user-supplied mask region; in the default crop_and_paste=True flow the inpainted patch is pasted back onto the clean reference image so non-masked pixels remain identical to the input. Optionally Poisson blending can be enabled. Generation metadata (per-sample guidance, crop_ratio, seed, etc.) is written to SDG_result.csv alongside the images.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

  • PyTorch (via the Cosmos-Predict2 2B T2I pipeline)
  • Cosmos AnomalyGen scripts (scripts.anomaly_gen.synthetic_dataset_generation, torchrun-based)
  • NVIDIA TAO Toolkit — interop via DAFT v3.0 export (scripts.anomaly_gen.convert_to_daft_format)

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere (A100)
  • NVIDIA Hopper (H100)
  • NVIDIA RTX 6000

Supported Operating System(s):

  • Linux

The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.

Model Version(s):

v1.0.0 —…

Excerpt shown — open the source for the full document.

Notability

notability 6.0/10

Notable NVIDIA model release, specialized domain.