NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales In Real Time

NVIDIA’s Nemotron Speech team has launched Nemotron 3.5 ASR, a 600-million-parameter streaming automatic speech recognition model. A single checkpoint enables real-time transcription across 40 language-locales, with built-in punctuation and capitalization. The model is available as open weights on Hugging Face under the OpenMDW-1.1 license. Its architecture is a Cache-Aware FastConformer-RNNT.

What is Nemotron 3.5 ASR

Nemotron 3.5 ASR builds upon nvidia/nemotron-speech-streaming-en-0.6b to support many languages. It incorporates prompt-based language-ID conditioning into the base model, allowing this one 600M-parameter checkpoint to handle 40 language-locales without needing separate language-specific models or model switching.

The model is designed for two primary use cases: low-latency streaming for live audio and high-throughput batch transcription. Outputs include properly cased and punctuated text, eliminating the need for a separate punctuation restoration step.

How Cache-Aware FastConformer-RNNT Works

The model consists of two main components. The first is a Cache-Aware FastConformer encoder with 24 layers. FastConformer is an efficient evolution of the Conformer architecture that uses linearly scalable attention. The second component is an RNNT (Recurrent Neural Network Transducer) decoder, which generates text frame by frame as audio streams in.

The “cache-aware” design is the key efficiency mechanism. Traditional buffered streaming re-processes overlapping audio windows at every step, duplicating work and adding delay. This model instead caches encoder self-attention and convolution activations, reusing these cached states as new audio arrives. As a result, each audio frame is processed exactly once without overlap, reducing both computational overhead and end-to-end latency without sacrificing accuracy.

The Latency Knob: `att_context_size`

A single inference setting controls the trade-off between latency and accuracy: the attention context size, att_context_size. A smaller context produces text faster but considers less future audio, while a larger context improves accuracy at the cost of higher latency.

The same checkpoint supports the full range of settings, corresponding to chunk sizes of 80ms, 160ms, 320ms, 560ms, and 1.12s. For example, [56,0] enables an 80ms ultra-low-latency mode, while [56,13] provides 1.12s for maximum accuracy. Teams can select the optimal operating point at inference time without any retraining.

Language Detection and Coverage

The 40 language-locales encompass English, Spanish, German, and French variants, as well as Arabic, Japanese, Korean, Mandarin, Hindi, and Thai. Several other European and Nordic languages are also supported.

Language conditioning operates in two ways. Setting target_lang to a known locale generally delivers the best accuracy. Alternatively, setting target_lang=auto lets the model detect the language on its own, emitting a language tag after terminal punctuation. This allows a single deployment to transcribe mixed-language audio without requiring a separate language-ID component.

Comparison

Product	Company	Access	Native streaming	Language coverage	Reported latency	Pricing model
Nemotron 3.5 ASR	NVIDIA	Open weights (OpenMDW-1.1), self-host; hosted on DeepInfra	Yes — cache-aware FastConformer-RNNT	40 language-locales	80ms–1.12s, configurable at inference	Free to self-host; usage-based via host
Whisper large-v3	OpenAI	Open weights (MIT), self-host; API	No — offline/batch	~99 languages	Not streaming-native	Self-host free; API ~$0.006/min

Nova-3	Deepgram	Closed API; enterprise on-premise/self-host options	Yes — works with both real-time streams and batch uploads	Multilingual model, plus 10 extra single-language models added in January 2026	Ultra-fast streaming with response times reportedly under 300 milliseconds	Around $0.0077 per minute (Nova-3 Monolingual, pay-as-you-go pricing)
Universal-3 Pro Streaming	AssemblyAI	Closed API (also available via EU-based endpoint)	Yes	Supports six languages: English, Spanish, French, German, Italian, and Portuguese	Under 300ms per official specs; first partial result lands in about 750ms	Pay-as-you-go (PAYG) based on usage
Scribe v2 Realtime	ElevenLabs	Closed API	Yes	Handles over 90 languages (99, according to ElevenLabs)	Approximately 150ms at the 50th percentile (p50)	Roughly $0.28 per hour
Ursa / streaming	Speechmatics	API, on-premise, and edge deployment all available	Yes — supports both streaming and batch	Over 50 languages with automatic language detection	Positioned as ultra-low latency	Enterprise or usage-based pricing

Results from Fine-Tuning

Because the model weights are publicly available, teams can fine-tune them for a specific language, domain, or accent. NVIDIA demonstrated this with a practical walkthrough using Greek and Bulgarian. The base model checkpoint was fine-tuned using the same Cache-Aware FastConformer-RNNT recipe. Every training audio clip included a target_lang tag to condition the model on the correct language. The training datasets were sourced from Granary, Common Voice, and FLEURS — all publicly available collections.

Performance was evaluated on held-out portions of the FLEURS benchmark, measured as Word Error Rate (WER) at the 80ms latency setting. Greek WER improved from 35 down to 24, representing a 32% relative gain. Bulgarian WER dropped from 22 to 15, a 31% relative improvement. These are straightforward WER percentages from the most aggressive low-latency streaming mode. NVIDIA emphasizes that measuring performance at the actual deployment latency, on data the model has never seen, delivers the most honest picture.

Strengths and Things to Keep in Mind

Strengths:

A single 600M-parameter checkpoint serves 40 language-locales, keeping deployment complexity low.
Cache-aware streaming processes each audio just once, reportedly enabling 17 times the concurrent streams compared to buffered methods on an H100 GPU.
The att_context_size parameter lets you dial latency anywhere from 80ms up to 1.12 seconds at inference time — no retraining needed.
Punctuation, capitalization, and automatic (auto) language identification are all handled natively by the model.
Open weights made it possible to achieve a 31–32% relative WER reduction on Greek and Bulgarian through fine-tuning.

Considerations:

The model supports English, but NVIDIA suggests using its purpose-built English-only model when you only need English transcription.
The 80ms mode sacrifices a bit of accuracy to achieve the fastest possible response time.
Japanese and Korean use Character Error Rate (CER) metrics rather than WER, so comparing error rates across languages requires caution.
All throughput benchmarks were measured on H100 hardware, so performance on other GPUs will vary.
The production-ready NIM with gRPC streaming has been announced but has not yet been released.

Key Takeaways

NVIDIA’s Nemotron 3.5 ASR is an open-weights (licensed under OpenMDW-1.1) streaming model with 600 million parameters, capable of transcribing 40 language-locales from one unified checkpoint.
Its Cache-Aware FastConformer-RNNT architecture processes each audio frame exactly once, reportedly supporting 17 times more concurrent streams than traditional buffered approaches on an H100.
Latency can be adjusted from 80ms to 1.12 seconds directly at inference time through the att_context_size setting — no model retraining required.
A brief fine-tuning run reduced FLEURS WER by 32% on Greek (35→24) and 31% on Bulgarian (22→15) at the 80ms latency setting.
The model is designed for self-hosted, streaming-first deployment, setting it apart from closed API offerings (Deepgram, AssemblyAI, ElevenLabs) and offline tools like Whisper.

Marktechpost’s Visual Explainer

NEMOTRON 3.5 ASR
1 / 10

NVIDIA · STREAMING SPEECH AI · OPEN WEIGHTS

Nemotron 3.5 ASR

A 600M-parameter cache-aware streaming model that transcribes 40 language-locales in real time, from a single checkpoint.

600M parameters
40 language-locales
80ms–1.12s latency
OpenMDW-1.1

01 — WHAT IT IS

One model, 40 language-locales

Extends nvidia/nemotron-speech-streaming-en-0.6b with prompt-based language-ID conditioning.
A single 600M-parameter checkpoint covers 40 language-locales. No model-swapping.
Punctuation and capitalization are built in. No separate post-processing step.
Targets two workloads: low-latency streaming and high-throughput batch.
NVIDIA still recommends its English-only model for English-only use.

02 — ARCHITECTURE

Cache-Aware FastConformer-RNNT

A 24-layer FastConformer encoder paired with an RNNT decoder.
Buffered streaming re-processes overlapping audio windows at every step.
This model caches encoder self-attention and convolution states, then reuses them.
Each audio frame is processed exactly once, with no overlap.
Compute and end-to-end latency drop, with no accuracy penalty.

03 — THE LATENCY KNOB

One setting tunes latency vs. accuracy

att_context_size	Chunk (latency)	Use case
`[56,0]`	80ms (Ultra-Low)	Ultra low latency voice agents
`[56,1]`	160ms (Low)	Interactive voice agents
`[56,3]`	320ms (Balanced)	Conversational AI, live caption
`[56,6]`	560ms (Medium)	Higher accuracy, reasonable latency
`[56,13]`	1.12s (High)	Top precision

Identical model checkpoint, selected during inference. No additional training needed.

04 — LANGUAGES & DETECTION

Broad language support with built-in language identification

Supports 40 language variants, such as English, Spanish, German, and French.
Also includes Arabic, Japanese, Korean, Mandarin, Hindi, and Thai.
For optimal results, specify target_lang with a known locale.
Use target_lang=auto to enable automatic language detection.
In automatic mode, the model outputs a language identifier following sentence-ending punctuation.
A single deployment manages multilingual input without requiring a dedicated language-detection module.

05 — THROUGHPUT

Compact size enables higher parallel processing

NVIDIA benchmarks against Parakeet RNNT 1.1B multilingual, a buffered-streaming model.
Nemotron 3.5 ASR is approximately half the parameter count: 0.6B compared to 1.1B.
The team achieves 17 times more simultaneous streams than buffered methods on a single H100 GPU.
Eliminating repeated computations reduces per-stream operational expenses.

The 17x metric comes from the official launch announcement; the model documentation confirms the general performance advantage.

06 — FINE-TUNING RESULTS

Brief fine-tuning significantly improves underperforming languages

Language	Initial WER	After Fine-tuning	Improvement
Greek	35	24	32%
Bulgarian	22	15	31%

Word Error Rate (%) measured on unseen FLEURS test data at 80ms latency. Training data: Granary, Common Voice, FLEURS.

07 — AVAILABILITY & ACCESS

Openly available weights with managed deployment options

Model weights hosted on Hugging Face under the OpenMDW-1.1 license.
Requires NeMo 26.06 or later. Audio input must be single-channel.
Available via DeepInfra, which offers word boosting for specialized terminology.
NVIDIA plans a NIM release later this month, featuring gRPC streaming support.
Supported GPU architectures: Ampere, Hopper, Blackwell, Lovelace, Turing, Volta, and Jetson.

08 — HOW IT COMPARES

Positioning within the current ecosystem

Solution	Availability	Streaming	Language Coverage
Nemotron 3.5 ASR	Open weights	Built-in	40 locales
Whisper large-v3	Open weights	No (batch-only)	~99
Deepgram Nova-3	API / on-premises	Built-in	Multilingual
AssemblyAI U-3 Pro	API	Built-in	6
ElevenLabs Scribe v2	API	Built-in	90+
Google Chirp / Azure	API	Built-in	100+ / 140+

Latency and accuracy metrics vary across providers; this table highlights architectural differences, not performance rankings.

09 — KEY TAKEAWAYS

Summary highlights

A 600-million-parameter open-weight streaming model handling 40 language variants from a single checkpoint.
Cache-optimized architecture processes each audio frame once; delivers 17x greater concurrency than buffered approaches on H100.
Adjustable latency from 80ms to 1.12s during inference, without retraining.
Brief fine-tuning reduced FLEURS word error rate by 32% for Greek and 31% for Bulgarian.
Self-hosted and natively streaming—unlike proprietary APIs or offline-only Whisper.

Curated for AI engineers by Marktechpost — practitioner-first coverage of AI & ML.

Explore the Model weights. Also, follow us on Twitter and join our 150k+ ML SubReddit and subscribe to our Newsletter. On Telegram? Join our channel there too.

Interested in collaborating to promote your GitHub repository, Hugging Face page, product launch, or webinar? Get in touch

Top Posts

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

Nemotron 3.5 ASR

One model, 40 language-locales

Cache-Aware FastConformer-RNNT

One setting tunes latency vs. accuracy

Broad language support with built-in language identification

Compact size enables higher parallel processing

Brief fine-tuning significantly improves underperforming languages

Openly available weights with managed deployment options

Positioning within the current ecosystem

Summary highlights

After Years of Testing: Why Wireless Security Cameras Beat Wired Systems for Home Protection

Turning AI Against Us – Why Betraying Users Is the Next Frontier

21 Game-Changing Low-Code and No-Code AI Tools You Need in 2026

My 25,000-Mile CarPlay Journey: The Apps I Couldn’t Live Without (and Why)

Predicting the Victor of the 2026 Soccer World Cup

Chinese Peptide Labs Surge on Crypto Capital Influx

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

Frontier AI Models Expose Critical Crypto Vulnerabilities, Yet Experts Claim the Industry Lacks Preparedness

Every New iPhone Until I Tweak These Settings—And The Reason Is a Game-Changer

After Years of Testing: Why Wireless Security Cameras Beat Wired Systems for Home Protection

Building Multi-Agent Systems in Python: A Hands-On Approach

“Massive Bitcoin Crash Looming? DWF Labs Co-founder Sounds Alarm on MicroStrategy and BitMine”

Trending

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

What is Nemotron 3.5 ASR

How Cache-Aware FastConformer-RNNT Works

The Latency Knob: att_context_size

Language Detection and Coverage

Comparison

Results from Fine-Tuning

Strengths and Things to Keep in Mind

Strengths:

Considerations:

Key Takeaways

Marktechpost’s Visual Explainer

Nemotron 3.5 ASR

One model, 40 language-locales

Cache-Aware FastConformer-RNNT

One setting tunes latency vs. accuracy

Broad language support with built-in language identification

Compact size enables higher parallel processing

Brief fine-tuning significantly improves underperforming languages

Openly available weights with managed deployment options

Positioning within the current ecosystem

Summary highlights

Related Posts

The Latency Knob: `att_context_size`