ModelIBM (Granite)IBM (Granite)published Apr 16, 2026seen 5d

ibm-granite/granite-speech-4.1-2b-plus

Open original ↗

Captured source

source ↗
published Apr 16, 2026seen 5dcaptured 15hhttp 200method plaintask automatic-speech-recognitionlicense apache-2.0library transformersparams 2.1Bdownloads 17klikes 70

Granite-Speech-4.1-2B-Plus

Model Summary

Granite-Speech-4.1-2B-Plus has similar capabilities to the Granite-Speech-4.1-2B model. The plus model adds two new community-requested rich transcription features that can be activated with a simple prompt change: speaker-attributed ASR (speaker labels and word transcripts) and word-level timing information. Unlike the base mode, the plus model doesn't provide punctuation and capitalization.

The model was trained on corpora similar to the Granite-Speech-4.1-2B model which were augmented with speaker turns and word-level timestamp tags. This allows the model to provide different modes of functionality controlled by different prompts.

Two additional model variants explore different capabilities and inference optimization:

  • Granite-Speech-4.1-2B for applications where accuracy is the primary concern with support for punctuated, capitalized transcripts, AST and keyword-biased recognition, and includes Japanese.
  • Granite-Speech-4.1-2B-NAR introduces a novel non-autoregressive architecture for higher throughput

ASR only mode

In this mode the model generates only the text transcript similar to the Granite-Speech-4.1-2B model.

Speaker attributed ASR (SAA)

In this mode, the model adds speaker tags in the format of [Speaker N]: where $N$ is the speaker number, before each speaker turn. The speakers are numbered by their order of appearance so the first speaker will always be marked with [Speaker 1]: and the second with [Speaker 2]:, etc. For example: "[Speaker 1]: Hello how are you [Speaker 2]: I'm fine and how are you feeling [Speaker 1]: I feel wonderful".

See [Resources](#resources) for more information about SAA.

Word-level timestamps

In this mode, the model adds timestamp tags after each word indicating the end of the word in the audio. Silences are transcribed as _ and a timestamp tag also indicates their end. The format of the tag is [T:N] where $N$ is an integer number indicating the time in centiseconds (1/100th of a second). To reduce the amount of generated tokens, only the last three digits of $N$ are provided. This causes a rollover after 10 seconds.

The conversion from time $t$ in seconds to timestamp is $N = round(t*100) \mod 1000$. To convert back to seconds, use $t = N/100 + 10R$ where $R$ is the rollover counter. See code below for example implementation in Python.

See [Resources](#resources) for more information about timestamps.

Incremental decoding

There are cases where we want to transcribe a new audio segment along with previous segments that we've already transcribed. This can be useful for providing longer context for the model in order to improve transcription accuracy or to maintain the speaker numbering in SAA mode. To avoid re-decoding the previous segments, we can provide the previous transcription in the prefix_text field of the conversation template. The model will decode the parts after that. See the code below for examples.

Keyword list biasing (KWB)

Keyword list biasing capability is available to enhance the recognition of keywords, such as names and technical terms. This is particularly useful in tasks where complex terms may otherwise be misrecognized. Keyword biasing can be applied by including the keywords directly in the prompt; for example, in ASR mode: Can you transcribe the speech into a written format? Keywords: …

Users may provide either a single keyword or a list of keywords, which may also include terms that do not appear in the input audio, making them well suited for batch processing or recurring domain-specific use cases.

See [Resources](#resources) for more information about keyword list biasing.

Evaluations

Our evaluations showed that this model works well with audio segments up to 9 minutes long for ASR and SAA, and up to 5 minutes for timestamps.

ASR

Performance on **HuggingFace Open ASR leaderboard**: | model | Average WER | AMI | Earnings22 | Gigaspeech | LS Clean | LS Other | SPGISpeech | Tedlium | Voxpopuli | | :----------------------------------------- | :-------------: | :-----: | :------------: | :------------: | :----------: | :----------: | :------------: | :---------: | :-----------: | | ibm-granite/granite-speech-4.1-2b-plus | 5.71 | 8.63 | 8.68 | 10.38 | 1.44 | 3.06 | 3.72 | 3.89 | 5.9 | | ibm-granite/granite-speech-4.1-2b | 5.33 | 8.09 | 8.37 | 9.8 | 1.33 | 2.5 | 3.78 | 3.07 | 5.7 | | ibm-granite/granite-speech-4.1-2b-nar | 5.44 | 8.03 | 8.44 | 10.16 | 1.28 | 2.77 | 3.33 | 3.62 | 5.86 |

(Using speculative decoding)

Keyword list biasing accuracy - Keyword F1 score (%, ↑ higher is better):

| Mode | Gigaspeech | LS-C | LS-O | SPGISpeech | VOX | TED_LIUM | Earnings22 | CV-en | CV-de | CV-es | CV-fr | CV-pt | | ----------- | ---------- | -------- | -------- | ---------- | -------- | -------- | ---------- | -------- | -------- | -------- | -------- | -------- | | Without KWB | 74.2 | 89.1 | 78.2 | 80.8 | 93.9 | 87.9 | 68.8 | 74.6 | 78.5 | 83.1 | 74.5 | 90.0 | | With KWB | 84.1 | 96.1 | 93.0 | 92.5 | 96.3 | 94.9 | 81.5 | 91.5 | 92.9 | 93.9 | 90.6 | 95.0 |

Speaker Attributed ASR

Speaker Attributed ASR performance - WDER (%, ↓ lower is better):

| Model | FISHER | CALLHOME English | AMI-SDM | GALE | | :----------------------------- | :--------: | :------------------: | :---------: | :------: | | VibeVoice ASR [1] | 2.8 | 7.1 | 27.4 | 44.8 | | Granite-speech-4.1-2b-plus | 0.9 | 2.2 | 14.6 | 30.2 |

The results are averaged over 2-5 minute speech segments.

(The evaluation metric: Word Diarization Error Rate [WDER] is the percentage of words attributed to the wrong speaker)

Timestamps

Word-level timestamp accuracy - AAS (ms, ↓ lower is better):

| Model | AMI-I | AMI-S | LS-C | LS-O | VOX | CV | MLS | TMT | En Avg | MLS-fr | MLS-es | MLS-de |…

Excerpt shown — open the source for the full document.

Notability

notability 7.0/10

Notable IBM speech model with good HF traction