NVIDIA’s Nemotron Speech team has launched Nemotron 3.5 ASR, a 600-million-parameter streaming automatic speech recognition model. A single checkpoint enables real-time transcription across 40 language-locales, with built-in punctuation and capitalization. The model is available as open weights on Hugging Face under the OpenMDW-1.1 license. Its architecture is a Cache-Aware FastConformer-RNNT.
What is Nemotron 3.5 ASR
Nemotron 3.5 ASR builds upon nvidia/nemotron-speech-streaming-en-0.6b to support many languages. It incorporates prompt-based language-ID conditioning into the base model, allowing this one 600M-parameter checkpoint to handle 40 language-locales without needing separate language-specific models or model switching.
The model is designed for two primary use cases: low-latency streaming for live audio and high-throughput batch transcription. Outputs include properly cased and punctuated text, eliminating the need for a separate punctuation restoration step.

How Cache-Aware FastConformer-RNNT Works
The model consists of two main components. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture that uses linearly scalable attention. The second component is an RNNT (Recurrent Neural Network Transducer) decoder, which generates text frame by frame as audio streams in.
The “cache-aware” design is the key efficiency mechanism. Traditional buffered streaming re-processes overlapping audio windows at every step, duplicating work and adding delay. This model instead caches encoder self-attention and convolution activations, reusing these cached states as new audio arrives. As a result, each audio frame is processed exactly once without overlap, reducing both computational overhead and end-to-end latency without sacrificing accuracy.
The Latency Knob: att_context_size
A single inference setting controls the trade-off between latency and accuracy: the attention context size, att_context_size. A smaller context produces text faster but considers less future audio, while a larger context improves accuracy at the cost of higher latency.
The same checkpoint supports the full range of settings, corresponding to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, [56,0] enables an 80ms ultra-low-latency mode, while [56,13] provides 1.12s for maximum accuracy. Teams can select the optimal operating point at inference time without any retraining.
Language Detection and Coverage
The 40 language-locales encompass English, Spanish, German, and French variants, as well as Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are also supported.
Language conditioning operates in two ways. Setting target_lang to a known locale generally delivers the best accuracy. Alternatively, setting target_lang=auto lets the model detect the language on its own, emitting a language tag after terminal punctuation. This allows a single deployment to transcribe mixed-language audio without requiring a separate language-ID component.
Comparison
| Product | Company | Access | Native streaming | Language coverage | Reported latency | Pricing model |
|---|---|---|---|---|---|---|
| Nemotron 3.5 ASR | NVIDIA | Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra | Yes — cache-aware FastConformer-RNNT | 40 language-locales | 80ms–1.12s, configurable at inference | Free to self-host; usage-based via host |
| Whisper large-v3 | OpenAI | Open weights (MIT), self-host; API | No — offline/batch | ~99 languages | Not streaming-native | Self-host free; API ~$0.006/min |
| Nova-3 | Deepgram | Closed API; enterprise on-premise/self-host options | Yes — works with both real-time streams and batch uploads | Multilingual model, plus 10 extra single-language models added in January 2026 | Ultra-fast streaming with response times reportedly under 300 milliseconds | Around $0.0077 per minute (Nova-3 Monolingual, pay-as-you-go pricing) |
| Universal-3 Pro Streaming | AssemblyAI | Closed API (also available via EU-based endpoint) | Yes | Supports six languages: English, Spanish, French, German, Italian, and Portuguese | Under 300ms per official specs; first partial result lands in about 750ms | Pay-as-you-go (PAYG) based on usage |
| Scribe v2 Realtime | ElevenLabs | Closed API | Yes | Handles over 90 languages (99, according to ElevenLabs) | Approximately 150ms at the 50th percentile (p50) | Roughly $0.28 per hour |
| Ursa / streaming | Speechmatics | API, on-premise, and edge deployment all available | Yes — supports both streaming and batch | Over 50 languages with automatic language detection | Positioned as ultra-low latency | Enterprise or usage-based pricing |
Results from Fine-Tuning
Because the model weights are publicly available, teams can fine-tune them for a specific language, domain, or accent. NVIDIA demonstrated this with a practical walkthrough using Greek and Bulgarian. The base model checkpoint was fine-tuned using the same Cache-Aware FastConformer-RNNT recipe. Every training audio clip included a target_lang tag to condition the model on the correct language. The training datasets were sourced from Granary, Common Voice, and FLEURS — all publicly available collections.
Performance was evaluated on held-out portions of the FLEURS benchmark, measured as Word Error Rate (WER) at the 80ms latency setting. Greek WER improved from 35 down to 24, representing a 32% relative gain. Bulgarian WER dropped from 22 to 15, a 31% relative improvement. These are straightforward WER percentages from the most aggressive low-latency streaming mode. NVIDIA emphasizes that measuring performance at the actual deployment latency, on data the model has never seen, delivers the most honest picture.
Strengths and Things to Keep in Mind
Strengths:
- A single 600M-parameter checkpoint serves 40 language-locales, keeping deployment complexity low.
- Cache-aware streaming processes each audio just once, reportedly enabling 17 times the concurrent streams compared to buffered methods on an H100 GPU.
- The
att_context_sizeparameter lets you dial latency anywhere from 80ms up to 1.12 seconds at inference time — no retraining needed. - Punctuation, capitalization, and automatic (
auto) language identification are all handled natively by the model. - Open weights made it possible to achieve a 31–32% relative WER reduction on Greek and Bulgarian through fine-tuning.
Considerations:
- The model supports English, but NVIDIA suggests using its purpose-built English-only model when you only need English transcription.
- The 80ms mode sacrifices a bit of accuracy to achieve the fastest possible response time.
- Japanese and Korean use Character Error Rate (CER) metrics rather than WER, so comparing error rates across languages requires caution.
- All throughput benchmarks were measured on H100 hardware, so performance on other GPUs will vary.
- The production-ready NIM with gRPC streaming has been announced but has not yet been released.
Key Takeaways
- NVIDIA’s Nemotron 3.5 ASR is an open-weights (licensed under OpenMDW-1.1) streaming model with 600 million parameters, capable of transcribing 40 language-locales from one unified checkpoint.
- Its Cache-Aware FastConformer-RNNT architecture processes each audio frame exactly once, reportedly supporting 17 times more concurrent streams than traditional buffered approaches on an H100.
- Latency can be adjusted from 80ms to 1.12 seconds directly at inference time through the
att_context_sizesetting — no model retraining required. - A brief fine-tuning run reduced FLEURS WER by 32% on Greek (35→24) and 31% on Bulgarian (22→15) at the 80ms latency setting.
- The model is designed for self-hosted, streaming-first deployment, setting it apart from closed API offerings (Deepgram, AssemblyAI, ElevenLabs) and offline tools like Whisper.
Marktechpost’s Visual Explainer
NEMOTRON 3.5 ASR
1 / 10
Curated for AI engineers by Marktechpost — practitioner-first coverage of AI & ML.
Explore the Model weights. Also, follow us on Twitter and join our 150k+ ML SubReddit and subscribe to our Newsletter. On Telegram? Join our channel there too.
Interested in collaborating to promote your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch




