Microsoft AI Unveils MAI-Transcribe-1.5: Record-Breaking 2.4% WER, Top FLEURS Accuracy, And 5x Faster Long-Audio Transcription

Microsoft AI recently unveiled MAI-Transcribe-1.5, the latest version of its internal speech-to-text technology. This update focuses on improving accuracy across 43 languages, various accents, and challenging acoustic settings. Microsoft is positioning this tool for enterprise-level transcription tasks.

Overview of MAI-Transcribe-1.5

MAI-Transcribe-1.5 is an automatic speech recognition (ASR) system. It converts spoken audio into written text. Unlike many competing models, this tool was created entirely in-house. It offers a unified system that understands 43 languages. It is specifically designed to perform well with different dialects and varied real-world sound environments.

Microsoft plans to incorporate this model into several of its products, including Copilot, Teams, GitHub, and Dynamics 365 Contact Centre. It is also accessible through Foundry, Microsoft’s platform for AI models.

Accuracy Metrics

Accuracy in this context is measured using Word-Error-Rate (WER). A lower WER indicates fewer errors in the transcribed text. Microsoft claims the model achieves top-tier WER among 43 languages on the FLEURS benchmark. FLEURS is a widely recognized standard for testing multilingual transcription systems.

On the Artificial Analysis leaderboard, the model has a WER of 2.4%, ranking it third. This creates a mixed picture: Microsoft claims the model is first on FLEURS but third on Artificial Analysis.

Another major improvement is the expansion of language support. The model now covers 43 languages, up from 25. This expansion was made without reducing accuracy. Ten of the new languages are South Asian (like Bengali, Tamil, and Telugu), and eight are European (such as Ukrainian, Greek, and Catalan).

Processing Speed

MAI-Transcribe-1.5 currently leads in the balance of accuracy and speed on the Artificial Analysis leaderboard. It can process speech up to five times faster than other models with similar accuracy, and this advantage is most noticeable with lengthy recordings. For example, it can transcribe a full hour of audio in under 15 seconds.

Microsoft states the model is up to five times faster than Gemini 3.1, Scribe v2, and GPT-4o-Transcribe when handling audio. Compared to the previous generation, MAI-Transcribe-1, it reportedly performs long-form transcription up to 5.7 times faster on Azure infrastructure. For processing large volumes of audio, this speed increase is highly significant.

Keyword (Entity) Biasing: A Key Feature

Standard transcription tools frequently struggle with specialized words, such as names of people, products, medical terms, or company-specific jargon. However, these are often the most critical words for business users.

MAI-Transcribe-1.5 introduces keyword biasing, also known as entity biasing. Users can provide a list of specific words (up to 200 keywords supported on Azure), and the model will prioritize those terms when generating its results. Importantly, it does not simply force those words into the output; it analyzes the context of the speech to decide when they are appropriate. Microsoft reports that using this feature reduces WER by 30% on the FLEURS benchmark.

A quick example demonstrates the effectiveness of this approach. Without biasing, unique names might appear as “Sean,” “Oif,” or “Societal.” With a specific list of names provided, the system correctly identifies them as “Shaun,” “Aoife,” and “Xochitl.” This functionality is particularly valuable in meetings, medical settings, and customer service centers.

Production Use Cases

The Azure document highlights several practical applications for this model in a production environment:

Video subtitles: Creating subtitles for digital media and content platforms.
Accessibility support: Providing tools for those who require accurate captions.
Meeting notes: Generating transcripts for collaborative platforms like Microsoft Teams.
Customer service analysis: Analyzing audio from contact centers.
Workflow creation: Speeding up draft transcript creation for content creators.
Voice-activated agents: Preparing speech data for reasoning systems.

The model includes automatic language detection, which is helpful when the speaker’s language is not known beforehand.

Comparing MAI-Transcribe-1.5 and MAI-Transcribe-1

The table below outlines the differences between the two model generations based on official specifications.

Feature	MAI-Transcribe-1	MAI-Transcribe-1.5
Supported Languages	25	43
Keyword/Entity Biasing	Not available	Supports up to 200 keywords
Long-form Speed	Baseline	Up to 5.7x faster
Artificial Analysis WER	Not specified	2.4% (Ranked #3)
FLEURS Ranking	Previous state-of-the-art	Top-ranked across 43 languages
Automatic Language Detection	Not specified	Yes
Release Status	Initial release	Generally Available (GA)
Input / Output	Audio / Text	Audio / Text

Strengths and Limitations

Strengths:

Covers 43 languages in a single system, an increase from 25.
Keyword biasing improves accuracy, reducing WER by as much as 30% on FLEURS.
Can transcribe one hour of audio in less than 15 seconds.
Available now via Azure AI Foundry.
Designed for reliability in noisy, real-world environments.

Limitations:

Lacks diarization, meaning it cannot identify different speakers.
Does not have a native streaming API, limiting instantaneous use.
Many performance, speed, and cost claims are sourced directly from Microsoft.
Ranks third on the Artificial Analysis leaderboard, following two other models.

Top Posts

Unlocking AI Mastery: 5 Essential Python Concepts Every Engineer Needs

Beyond One Data Center: Mastering Geo-Distributed AI with the k0smos Platform

Podcast: The Hidden Flaws Behind Reactive IoT Operations

Microsoft AI Unveils MAI-Transcribe-1.5: Record-Breaking 2.4% WER, Top FLEURS Accuracy, and 5x Faster Long-Audio Transcription

NVIDIA Nemotron 3.5 ASR: 600M-Parameter Cache-Aware Streaming Model Transcribes 40 Language-Locales in Real Time

After Years of Testing: Why Wireless Security Cameras Beat Wired Systems for Home Protection

Turning AI Against Us – Why Betraying Users Is the Next Frontier

21 Game-Changing Low-Code and No-Code AI Tools You Need in 2026

My 25,000-Mile CarPlay Journey: The Apps I Couldn’t Live Without (and Why)

Predicting the Victor of the 2026 Soccer World Cup

Unlocking AI Mastery: 5 Essential Python Concepts Every Engineer Needs

Beyond One Data Center: Mastering Geo-Distributed AI with the k0smos Platform

Podcast: The Hidden Flaws Behind Reactive IoT Operations

Microsoft AI Unveils MAI-Transcribe-1.5: Record-Breaking 2.4% WER, Top FLEURS Accuracy, and 5x Faster Long-Audio Transcription

Holding On: The Loyalists Still Betting on Terra Luna After Do Kwon’s Exit

Beyond the Firewall: How UNC3753 Weaponized Vishing and Physical Breaches to Extort U.S. Data

Google Unveils Gemma 4 QAT Checkpoints: Q4_0 and a Revolutionary Mobile Format Slash On-Device Memory

Acer Swift Air 14 vs. MacBook Neo: The Budget Laptop Winner After Testing Both

Trending

Unlocking AI Mastery: 5 Essential Python Concepts Every Engineer Needs

Beyond One Data Center: Mastering Geo-Distributed AI with the k0smos Platform

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Microsoft AI Unveils MAI-Transcribe-1.5: Record-Breaking 2.4% WER, Top FLEURS Accuracy, and 5x Faster Long-Audio Transcription

Overview of MAI-Transcribe-1.5

Accuracy Metrics

Processing Speed

Keyword (Entity) Biasing: A Key Feature

Production Use Cases

Comparing MAI-Transcribe-1.5 and MAI-Transcribe-1

Strengths and Limitations

Strengths:

Limitations:

Sources

Related Posts