ModelNVIDIANVIDIApublished May 15, 2026seen 5d

nvidia/nemotron-3.5-asr-streaming-0.6b

Open original ↗

Captured source

source ↗
published May 15, 2026seen 5dcaptured 11hhttp 200method plaintask automatic-speech-recognitionlicense otherlibrary nemodownloads 5klikes 357

Nemotron 3.5 ASR

h1, h2, h3, h4, h5, h6 { color: #76b900; /* NVIDIA green */ font-weight: 700; }

hr { border: none; border-top: 1px solid #e5e7eb; margin: 2rem 0; }

/* Improve list spacing */ ul, ol { margin-top: 0.5rem; margin-bottom: 0.5rem; }

/* Badge alignment consistency */ img { display: inline; vertical-align: middle; }

> [!Note] > This model is the multilingual extension of nvidia/nemotron-speech-streaming-en-0.6b, adding language-ID prompt conditioning to support transcription across 40 language-locales from a single model.

Nemotron 3.5 ASR is a multilingual, streaming Automatic Speech Recognition (ASR) model engineered to deliver high-quality multilingual transcription across both low-latency streaming and high-throughput batch workloads. Developed by NVIDIA, this 600M parameter model transcribes speech into text with native support for punctuation and capitalization, and offers runtime flexibility with configurable chunk sizes, including 80ms, 160ms, 320ms, 560ms, and 1120ms.

By leveraging a state-of-the-art Cache-Aware FastConformer-RNNT architecture, the model eliminates redundant overlapping computations common in traditional "buffered" streaming. This allows it to process only new audio chunks while reusing cached encoder context, significantly improving computational efficiency and minimizing end-to-end delay without sacrificing accuracy.

It was trained on a massive ASR dataset and is engineered to perform across diverse and challenging acoustic conditions.

This model is ready for commercial use.

---

License/Terms of Use

Governing Terms: Use of the model is governed by the OpenMDW-1.1 license.

Deployment Geography

Global

Use Case

This model is for transcription of multilingual audio.

Release Date

  • Hugging Face [06/04/2026] via https://huggingface.co/nvidia/nemotron-3.5-asr-streaming-0.6b

References

[1] Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition

[2] Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

[3] NVIDIA Granary

[4] NVIDIA NeMo Framework

Why Choose Nemotron 3.5 ASR?

  • 🌍 Single Multilingual Model: Transcribes 40 language-locales from one model through language-ID prompt conditioning, with optional automatic language detection.
  • Native Streaming Architecture: Cache-aware design enables efficient processing of continuous audio streams, designed and optimized for low-latency voice agent applications.
  • 💰 Improved Operational Efficiency: Delivers superior throughput compared to traditional buffered streaming approaches. This allows for a higher number of parallel streams within the same GPU memory constraints, directly reducing operational costs for production environments.
  • 🎛️ Dynamic Runtime Flexibility: Choose the optimal operating point on the latency-accuracy Pareto curve at inference time. No re-training is required to adjust for different use-case requirements.
  • 📝 Punctuation & Capitalization: Built-in support for punctuation and capitalization in output text.

---

Supported Languages

The model supports 40 language-locales in total, across three tiers:

  • Transcription-ready (19 locales): highest-accuracy ASR, ready out of the box.
  • Broad-coverage (13 locales): production ASR across an additional 13 locales.
  • Adaptation-ready (8 locales): recognized by the tokenizer; fine-tune on in-domain data to unlock full transcription.

| Tier | Languages (locales) | | :--- | :--- | | Transcription-ready (19 locales) | English (en-US, en-GB), Spanish (es-US, es-ES), French (fr-FR, fr-CA), Italian (it-IT), Portuguese (pt-BR, pt-PT), Dutch (nl-NL), German (de-DE), Turkish (tr-TR), Russian (ru-RU), Arabic (ar-AR), Hindi (hi-IN), Japanese (ja-JP), Korean (ko-KR), Vietnamese (vi-VN), Ukrainian (uk-UA) | | Broad-coverage (13 locales) | Polish (pl-PL), Swedish (sv-SE), Czech (cs-CZ), Norwegian Bokmål (nb-NO), Danish (da-DK), Bulgarian (bg-BG), Finnish (fi-FI), Croatian (hr-HR), Slovak (sk-SK), Mandarin (zh-CN), Hungarian (hu-HU), Romanian (ro-RO), Estonian (et-EE) | | Adaptation-ready (8 locales) | Greek (el-GR), Lithuanian (lt-LT), Latvian (lv-LV), Maltese (mt-MT), Slovenian (sl-SI), Hebrew (he-IL), Thai (th-TH), Norwegian Nynorsk (nn-NO) |

> Note: Transcription-ready and broad-coverage locales (32 total) produce ASR transcription out of the box; adaptation-ready locales require fine-tuning on in-domain data to enable full transcription. The model supports uppercase and lowercase letters, punctuation, spaces, and apostrophes.

> Note: We would recommend Nemotron ASR Streaming (English) model for English-only transcription use cases. For all other transcription ready locales, we recommend Nemotron 3.5 ASR to leverage its expanded multilingual capabilities.

> [!Tip] > Automatic language detection / language tagging: When run with target_lang=auto, the model detects the spoken language and emits the corresponding language code/tag in the output following the terminal punctuation. This lets a single deployment transcribe mixed-language traffic and automatically label each utterance with its detected language — no separate language-ID component required.

---

Model Architecture

Architecture Type: FastConformer-CacheAware-RNNT with Prompt

This model consists of a cache-aware streaming Parakeet (FastConformer) encoder with an RNN-T decoder and language-ID prompt conditioning. It is based on the Cache-Aware [\[1\]](#ref-1) FastConformer [\[2\]](#ref-2) architecture with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The cache-aware streaming design enables efficient processing of audio in chunks while maintaining context from previous frames. Unlike buffered inference, this model maintains caches for all encoder self-attention and convolution layers. This enables reuse of hidden states at every streaming step, where cached activations eliminate redundant computations. As a result, there are no overlapping computations; each processed frame is strictly non-overlapping. This model leverages prompts to…

Excerpt shown — open the source for the full document.

Notability

notability 5.0/10

New ASR model from Nvidia, moderate traction