ModelIBM (Granite)IBM (Granite)published Mar 10, 2026seen 5d

ibm-granite/granite-speech-4.1-2b-nar

Open original ↗

Captured source

source ↗
published Mar 10, 2026seen 5dcaptured 11hhttp 200method plaintask image-feature-extractionlicense apache-2.0library transformersparams 2.3Bdownloads 30klikes 50

Granite-Speech-4.1-2B-NAR

Model Summary: Granite-Speech-4.1-2B-NAR is a non-autoregressive (NAR) speech recognition model that formulates ASR as conditional transcript editing. Instead of decoding tokens one at a time, it edits a CTC hypothesis in a single forward pass using a bidirectional LLM, achieving competitive accuracy with faster inference than autoregressive alternatives. The model is based on the NLE (Non-autoregressive LLM-based Editing) architecture described in this paper.

For applications where accuracy is the primary concern, consider **granite-speech-4.1-2b**, an autoregressive model from the Granite Speech 4.1 family which achieves higher transcription accuracy at the cost of increased inference latency. Granite-speech-4.1-2b produces punctuated and capitalized transcripts, supports AST and keyword-biased recognition, and includes Japanese.

When speaker or word-timing information is needed, consider using **granite-speech-4.1-2b-plus**, which extends the above model with speaker-attributed ASR (speaker labels + word transcripts) and word-level timing information.

Release Date: April 2026

License: Apache 2.0

Supported Languages: English, French, German, Spanish, Portuguese

Intended Use: The model is intended for automatic speech recognition tasks, particularly in latency-sensitive applications where fast inference is critical.

Evaluation Results

Open ASR leaderboard results

RTFx-WER results on the Open ASR leaderboard (as of Apr 2026).

Additional results

Greedy decoding with bfloat16 inference. WER computed with jiwer after whisper_normalizer EnglishTextNormalizer normalization. Open ASR Leaderboard results may differ slightly due to normalization and scoring pipeline differences. Measured RTFx of ~1820 on a single H100 GPU (batched inference, batch size 128).

| Dataset | WER | Dataset | WER | |---|---|---|---| | LibriSpeech clean | 1.29 | MLS EN | 4.77 | | LibriSpeech other | 2.75 | MLS DE | 4.75 | | CommonVoice 15 EN | 6.50 | MLS ES | 3.31 | | CommonVoice 15 DE | 4.73 | MLS FR | 4.52 | | CommonVoice 15 ES | 4.02 | MLS PT | 11.86 | | CommonVoice 15 FR | 7.17 | AMI IHM | 7.91 | | CommonVoice 15 PT | 2.57 | AMI SDM | 19.59 | | Earnings-22 | 8.48 | GigaSpeech | 10.12 | | SPGISpeech | 3.04 | TED-LIUM | 3.67 | | VoxPopuli | 5.83 | | |

Usage

Installation

We require flash_attention_2 for inference, since this backend supports sequence packing and respects the is_causal=False flag. Requires transformers>=5.5.3 and torch>=2.9.1.

# Fresh install (CUDA 12.8, Python 3.10+)
pip install torch==2.9.1 torchaudio==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install transformers>=5.5.3 accelerate safetensors huggingface-hub tokenizers
pip install soundfile
pip install flash-attn==2.8.3 --no-build-isolation

Inference with transformers

import torch
import torchaudio
from huggingface_hub import hf_hub_download
from transformers import AutoModel, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-4.1-2b-nar"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True,
attn_implementation="flash_attention_2", device_map=device,
dtype=torch.bfloat16).eval()
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)

# Load sample audio from the repo
audio_path = hf_hub_download(repo_id=model_name, filename="10226_10111_000000.wav")
waveform, sr = torchaudio.load(audio_path)
if sr != 16000:
waveform = torchaudio.functional.resample(waveform, sr, 16000)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
waveform = waveform.squeeze(0)

# Extract features, run transcription, decode
inputs = processor([waveform], device=device)
output = model.transcribe(**inputs)
transcriptions = processor.batch_decode(output.preds)

print(f"Prediction: {transcriptions[0]}")

Model Architecture

The architecture consists of three components:

(1) CTC Speech Encoder (440M params)

A 16-layer Conformer encoder trained with CTC on character-level targets. It processes 16kHz audio with stacked log-mel features (80 mel bins, 2-frame stacking) and uses block attention with 4-second audio blocks and self-conditioning at layer 8. The encoder has a dual CTC head: alongside the character-level output, a secondary BPE head produces CTC logits over the LLM's 100K token vocabulary. The BPE head uses posterior-weighted pooling (window size 4) with importance weights derived from mid-layer blank probabilities (1 - blank_prob).

(2) Q-Former Projector (160M params)

A 2-layer window Q-Former that downsamples the concatenated hidden representations from 4 encoder layers (layers 4, 8, 12, 16) by 5x. Each 15-frame window is reduced to 3 queries via cross-attention, resulting in a 10Hz acoustic embedding rate for the LLM (2x from encoder + 5x from projector).

(3) Bidirectional LLM Editor (1B params, LoRA-adapted)

granite-4.0-1b-base with its causal attention mask removed, enabling bidirectional context. Adapted with LoRA (rank 128) applied to both attention and MLP layers. The LLM receives concatenated audio embeddings and an interleaved CTC hypothesis with insertion slots, then predicts the edited transcript in a single parallel forward pass using a CTC objective.

How Granite-speech NAR Works

1. The frozen CTC encoder produces acoustic embeddings and an initial hypothesis. 2. The hypothesis is interleaved with insertion slots (blank tokens between each token) 3. The projected audio embeddings are concatenated with the interleaved hypothesis embeddings 4. The bidirectional LLM predicts edits (copy, insert, delete, replace) at all positions simultaneously 5. CTC greedy decoding (argmax + collapse) produces the final transcript

This design exploits the identity mapping bias of Transformers: residual connections and tied embeddings make the model naturally inclined to copy input tokens, so it focuses learning capacity on corrections rather than full reconstruction.

##…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Solid download traction, notable company.