Microsoft AI recently unveiled MAI-Transcribe-1.5, the latest version of its internal speech-to-text technology. This update focuses on improving accuracy across 43 languages, various accents, and challenging acoustic settings. Microsoft is positioning this tool for enterprise-level transcription tasks.
Overview of MAI-Transcribe-1.5
MAI-Transcribe-1.5 is an automatic speech recognition (ASR) system. It converts spoken audio into written text. Unlike many competing models, this tool was created entirely in-house. It offers a unified system that understands 43 languages. It is specifically designed to perform well with different dialects and varied real-world sound environments.
Microsoft plans to incorporate this model into several of its products, including Copilot, Teams, GitHub, and Dynamics 365 Contact Centre. It is also accessible through Foundry, Microsoft’s platform for AI models.
Accuracy Metrics
Accuracy in this context is measured using Word-Error-Rate (WER). A lower WER indicates fewer errors in the transcribed text. Microsoft claims the model achieves top-tier WER among 43 languages on the FLEURS benchmark. FLEURS is a widely recognized standard for testing multilingual transcription systems.
On the Artificial Analysis leaderboard, the model has a WER of 2.4%, ranking it third. This creates a mixed picture: Microsoft claims the model is first on FLEURS but third on Artificial Analysis.
Another major improvement is the expansion of language support. The model now covers 43 languages, up from 25. This expansion was made without reducing accuracy. Ten of the new languages are South Asian (like Bengali, Tamil, and Telugu), and eight are European (such as Ukrainian, Greek, and Catalan).
Processing Speed
MAI-Transcribe-1.5 currently leads in the balance of accuracy and speed on the Artificial Analysis leaderboard. It can process speech up to five times faster than other models with similar accuracy, and this advantage is most noticeable with lengthy recordings. For example, it can transcribe a full hour of audio in under 15 seconds.
Microsoft states the model is up to five times faster than Gemini 3.1, Scribe v2, and GPT-4o-Transcribe when handling audio. Compared to the previous generation, MAI-Transcribe-1, it reportedly performs long-form transcription up to 5.7 times faster on Azure infrastructure. For processing large volumes of audio, this speed increase is highly significant.
Keyword (Entity) Biasing: A Key Feature
Standard transcription tools frequently struggle with specialized words, such as names of people, products, medical terms, or company-specific jargon. However, these are often the most critical words for business users.
MAI-Transcribe-1.5 introduces keyword biasing, also known as entity biasing. Users can provide a list of specific words (up to 200 keywords supported on Azure), and the model will prioritize those terms when generating its results. Importantly, it does not simply force those words into the output; it analyzes the context of the speech to decide when they are appropriate. Microsoft reports that using this feature reduces WER by 30% on the FLEURS benchmark.
A quick example demonstrates the effectiveness of this approach. Without biasing, unique names might appear as “Sean,” “Oif,” or “Societal.” With a specific list of names provided, the system correctly identifies them as “Shaun,” “Aoife,” and “Xochitl.” This functionality is particularly valuable in meetings, medical settings, and customer service centers.
Production Use Cases
The Azure document highlights several practical applications for this model in a production environment:
- Video subtitles: Creating subtitles for digital media and content platforms.
- Accessibility support: Providing tools for those who require accurate captions.
- Meeting notes: Generating transcripts for collaborative platforms like Microsoft Teams.
- Customer service analysis: Analyzing audio from contact centers.
- Workflow creation: Speeding up draft transcript creation for content creators.
- Voice-activated agents: Preparing speech data for reasoning systems.
The model includes automatic language detection, which is helpful when the speaker’s language is not known beforehand.
Comparing MAI-Transcribe-1.5 and MAI-Transcribe-1
The table below outlines the differences between the two model generations based on official specifications.
| Feature | MAI-Transcribe-1 | MAI-Transcribe-1.5 |
|---|---|---|
| Supported Languages | 25 | 43 |
| Keyword/Entity Biasing | Not available | Supports up to 200 keywords |
| Long-form Speed | Baseline | Up to 5.7x faster |
| Artificial Analysis WER | Not specified | 2.4% (Ranked #3) |
| FLEURS Ranking | Previous state-of-the-art | Top-ranked across 43 languages |
| Automatic Language Detection | Not specified | Yes |
| Release Status | Initial release | Generally Available (GA) |
| Input / Output | Audio / Text | Audio / Text |
Strengths and Limitations
Strengths:
- Covers 43 languages in a single system, an increase from 25.
- Keyword biasing improves accuracy, reducing WER by as much as 30% on FLEURS.
- Can transcribe one hour of audio in less than 15 seconds.
- Available now via Azure AI Foundry.
- Designed for reliability in noisy, real-world environments.
Limitations:
- Lacks diarization, meaning it cannot identify different speakers.
- Does not have a native streaming API, limiting instantaneous use.
- Many performance, speed, and cost claims are sourced directly from Microsoft.
- Ranks third on the Artificial Analysis leaderboard, following two other models.



