IBM has launched a pair of open speech recognition models — Granite Speech 4.1 2B and Granite Speech 4.1 2B-NAR — that offer a strong demonstration of what a ~2B-parameter speech model is capable of. Both are hosted on Hugging Face under the Apache 2.0 license.
The duo addresses a familiar challenge for enterprise AI teams: most production-quality automatic speech recognition (ASR) systems either require substantial computing resources or compromise accuracy to remain cost-effective. IBM’s approach is that thoughtful design choices can deliver both speed and accuracy simultaneously.
What These Models Actually Do
Granite Speech 4.1 2B is a streamlined speech-language model built for multilingual automatic speech recognition (ASR) and two-way automatic speech translation (AST) across English, French, German, Spanish, Portuguese, and Japanese. Its non-autoregressive sibling, Granite Speech 4.1 2B-NAR, handles only ASR — optimized specifically for low-latency use cases — and covers English, French, German, Spanish, and Portuguese, excluding Japanese. This is an important difference: teams requiring Japanese transcription or any speech translation should opt for the standard autoregressive version.
IBM also quietly introduced a third variant alongside these two. Granite Speech 4.1 2B-Plus brings speaker-attributed ASR and word-level timestamps to the table, for use cases that demand knowing who spoke what — and precisely when they said it.
Word Error Rate (WER) is the go-to metric for evaluating transcription quality. Lower numbers indicate better performance. A WER of 5% translates to roughly 5 incorrect words out of every 100. On the Open ASR Leaderboard (as of April 2026), Granite Speech 4.1 2B posts a mean WER of 5.33. Breaking down the benchmark results — on LibriSpeech clean, the model records a WER of 1.33, and 2.5 on LibriSpeech other.
The Architecture, Explained
At a high level, both models are built from the same three-part design — a speech encoder, a modality adapter, and a language model — though the decoding process differs considerably between them.
The first part is the speech encoder. The architecture relies on 16 conformer blocks trained with Connectionist Temporal Classification (CTC) featuring two classification heads — one for graphemic (character-level) outputs and another for BPE units — employing frame importance sampling to concentrate on the most informative segments of the audio. A Conformer is a neural network layer blending convolutional layers (effective at picking up local acoustic patterns) with attention mechanisms (suited to capturing long-range relationships). CTC is a training method that allows the model to learn from audio-text pairs without requiring precise frame-by-frame alignment.
The second part is a speech-text modality adapter. A 2-layer window query transformer (Q-Former) processes blocks of 15 1024-dimensional acoustic embeddings from the final conformer block, reducing the temporal resolution by a factor of 5 using 3 trainable queries per block and per layer — yielding an overall temporal downsampling factor of 25 — and producing a 10Hz acoustic embedding rate for the LLM. This adapter closes the divide between continuous acoustic features and discrete text tokens, shrinking the audio representation so the language model can handle it efficiently. In the NAR model, the Q-Former carries 160M parameters and down samples the concatenated hidden representations from four encoder layers (layers 4, 8, 12, and 16).
The third part is the language model. Granite Speech 4.1 2B leverages an intermediate checkpoint of granite-4.0-1b-base with a 128k context length, fine-tuned across all training corpora. In the NAR variant, this becomes a 1B-parameter bidirectional LLM editor — granite-4.0-1b-base with its causal attention mask stripped away to allow bidirectional context — enhanced with LoRA at rank 128 applied to both attention and MLP layers.
The Autoregressive vs. Non-Autoregressive Tradeoff
This is the area where the two models differ most significantly, and it carries direct implications for how they’re deployed in production.
In the standard Granite Speech 4.1 2B, text gets generated autoregressively — one token at a time, each conditioned on all tokens before it. This approach delivers reliable, high-quality transcripts with full support for AST, keyword-biased recognition, and punctuation, but it’s inherently sequential and slower when handling large volumes.
Granite Speech 4.1 2B-NAR adopts an entirely different strategy. Instead of producing tokens sequentially, it refines a CTC draft in a single forward pass through a bidirectional LLM, reaching competitive accuracy while outpacing autoregressive alternatives in throughput. This is the NLE (Non-autoregressive LLM-based Editing) architecture. In practice: the CTC encoder generates an initial rough transcript, that draft is interleaved with insertion placeholders, and then a bidirectional LLM simultaneously predicts edits — copy, insert, delete, or replace — at every position in one pass.
The NAR model achieved an RTFx of roughly 1820 on a single H100 GPU with batched inference at batch size 128. RTFx (real-time factor multiplier) measures how many times faster than real-time a model processes audio — an RTFx of 1820 means a one-hour audio file gets transcribed in under two seconds on that hardware. One practical detail engineers should keep in mind: the NAR model needs flash_attention_2 for inference, since this backend handles sequence packing and honors the is_causal=False flag.
Training Data and Infrastructure
The two models were trained on distinct datasets. The standard model was built from 174,000 hours of audio sourced from public corpora for ASR and AST, along with synthetic datasets crafted to support Japanese ASR, keyword-biased ASR, and speech translation. The NAR model was trained on roughly 130,000 hours of speech spanning five languages, drawing from publicly accessible datasets that include CommonVoice 15, MLS, LibriSpeech, LibriHeavy, AMI, Granary VoxPopuli, Granary YODAS, Earnings-22, Fisher, CallHome, and SwitchBoard.
The gap in infrastructure requirements between the two is notable as well. The standard model’s training wrapped up in 30 days — 26 days for the encoder and 4 days for the projector — spread across 8 H100 GPUs. The NAR model trained in just 3 days on 16 H100 GPUs (2 nodes) over 5 epochs — a considerably shorter training cycle, reflecting the relative simplicity of the editing-based architecture compared to full autoregressive generation.
Key Takeaways
Here are 5 concise takeaways:
- IBM has released two open ASR models — Granite Speech 4.1 2B (autoregressive) and Granite Speech 4.1 2B-NAR (non-autoregressive) — both at roughly 2B parameters, with Apache 2.0 licensing.
- The standard model posts a mean WER of 5.33 on the Open ASR Leaderboard, handles 6 languages for ASR (Japanese included), bidirectional speech translation, keyword biasing, and punctuation/truecasing — on par with models many times its size.
- The NAR model trades breadth for speed — it drops Japanese, AST, and keyword biasing, but hits an RTFx of ~1820 on a single H100 GPU by refining a CTC draft in one pass rather than generating tokens sequentially.
- The architecture rests on three core components — a 16-layer Conformer encoder trained with dual-head CTC, a 2-layer window Q-Former projector that compresses audio to a 10Hz embedding rate, and a fine-tuned granite-4.0-1b-base language model.
- A third variant, Granite Speech 4.1 2B-Plus, is also available — augmenting the standard model with speaker-attributed ASR and word-level timestamps for scenarios where identifying the speaker and exact timing are essential.
Explore Model-Granite Speech 4.1 2B and Model-Granite Speech 4.1 2B (NAR). Also, follow us on Twitter, join our 130k+ ML SubReddit, and subscribe to our Newsletter. Prefer Telegram? We’ve got a channel for that too.
Looking to promote your GitHub repo, Hugging Face page, product launch, or webinar? Get in touch with us



