The past year has seen rapid advancements in text-to-speech (TTS) technology. Synthetic voices now sound remarkably close to human speech. Latency in some real-time systems has dropped below 100 milliseconds, and emotional control has shifted from experimental demos to standard features. This guide focuses on the most impactful models in 2026, tailored for AI professionals making production decisions.
Understanding TTS benchmarks in 2026
Two benchmarks are central to most community discussions. The first is the Artificial Analysis Speech Arena Leaderboard, which ranks models using blind human preference via an ELO rating system. By 2026, it evaluates numerous production APIs. The second is the TTS Arena on Hugging Face, run by the community and using the same blind A/B voting approach.
These leaderboards reflect perceived quality, not technical accuracy, and rankings evolve constantly. As of May 30, 2026, the Artificial Analysis Speech Arena listed Gemini 3.1 Flash TTS, Realtime TTS-2 (Research Preview), Sonic 3.5, Realtime TTS 1.5 Max, and Fun-Realtime-TTS-Preview as its top five by ELO. These positions had changed within the previous weeks and will continue to shift. View any current ranking as a snapshot, not a permanent verdict.
Accuracy must be measured separately. Trelis Research tested ten models using a round-trip character error rate (CER) method: generated audio is transcribed by an ASR model and compared to the original text. Mean Opinion Score (MOS) is used to gauge perceived naturalness. Both metrics have limitations. Round-trip CER relies on the ASR model’s own accuracy, and the UTMOS quality estimator was trained on audio clips up to ten seconds long, resulting in less score variation for longer samples.
Latency is a critical third dimension. For voice agents, the key metric is time-to-first-audio (TTFA). Time-to-first-byte (TTFB) can be misleading, as container headers contain no actual audio. Consistency is as important as median performance. A Gradium benchmark from May 2026 measured the interquartile range across providers. It is tail latency, not the average, that defines user experience at scale.
In summary, no single benchmark tells the full story. Quality, accuracy, latency, language support, and cost all involve trade-offs. The best model depends on which factors are non-negotiable for your application.
Top commercial models
#1 Inworld TTS-1.5 and Realtime TTS-2
Inworld AI, a research lab founded by experts from Google and DeepMind, launched TTS-1.5 on January 21, 2026. The model is designed for real-time, consumer-grade applications. Inworld claims a 30% increase in expressive range compared to TTS-1 and a 40% improvement in stability, based on word error rate and output consistency.
TTS-1.5 is available in two variants. The Mini version is optimized for latency-sensitive uses like voice agents and gaming, while the Max version offers greater stability with low latency. Inworld reports P90 time-to-first-audio under 130 milliseconds for Mini and under 250 milliseconds for Max. The model supports 15 languages and provides both instant and professional voice cloning.
Pricing follows a tiered plan structure. For the On-Demand and Creator plans, Inworld charges $25 per million characters for TTS 1.5 Mini and $35 for Realtime TTS-2 and TTS 1.5 Max. The Developer and Growth plans offer lower rates, with Growth dropping to $15 for Mini and $25 for Max/TTS-2. Enterprise pricing can be as low as $5 and $10, respectively. Note that TTS-1.5 covers 15 languages, whereas TTS-2 supports over 100.
Inworld later introduced Realtime TTS-2 in 2026, described as a closed-loop voice model with enhanced control and expressiveness. Inworld has reported holding three of the top five positions on the Artificial Analysis Speech Arena across multiple snapshots.
Inworld is ideal for developers building voice agents at consumer scale, with its main appeal being the combination of low latency and competitive pricing.
#2 Google Gemini 3.1 Flash TTS
Google DeepMind released Gemini 3.1 Flash TTS on April 15, 2026. It is a preview model accessible via the Gemini API, Google AI Studio, Vertex AI, and Google Vids. The model features over 200 audio tags that control style, tone, pacing, accent, and scene direction.
According to Google, the model achieved an ELO of 1,211 on the Artificial Analysis leaderboard. It supports more than 70 languages and native multi-speaker dialogue. Built on the Gemini platform rather than a standalone speech system, the model treats speech generation as a language task, deciding both content and delivery.
The model has notable deployment constraints. A TTS session has a 32,000-token context window, and Google’s documentation states Gemini TTS does not support streaming. It is designed for controlled text narration, not interactive voice agents; for real-time use, Google recommends its separate Live API. Output quality may vary for generations longer than a few minutes, so Google advises splitting content into chunks. The model includes 30 prebuilt voices, and all generated audio is watermarked with SynthID for AI-content identification.
Gemini 3.1 Flash TTS is well-suited for podcast and audiobook production requiring precise control. It is a natural choice for teams already using Google Cloud.
#3 ElevenLabs v3
ElevenLabs released Eleven v3 in alpha on June 5, 2025, with general availability following in early 2026, as announced by the company. ElevenLabs considers it their most expressive model. It introduced inline audio tags in lowercase square brackets, such as [whispers], [laughs], [sighs], and scene cues like [interrupting]. The model supports over 70 languages.
The general release improved upon the alpha. ElevenLabs reports users preferred the new version about 72% of the time, with enhancements in handling numbers, symbols, and specialized notation.
A standout feature is Text to Dialogue, which combines multiple voices into a single generation. The model aligns prosody and emotional range across speakers and can manage interruptions and mood changes with minimal prompting.
Eleven v3 still requires more prompt engineering than earlier models and is not designed for real-time use. ElevenLabs notes that the larger model and higher-fidelity codec result in longer processing times. For real-time and conversational applications, they recommend Flash v2.5, which streams with low latency—around 75 milliseconds according to vendor data.
ElevenLabs v3 is a strong fit for narrative content, audiobooks, and character-driven work where quality takes priority over speed. It remains a popular starting point for high-quality voice generation.
#4 MiniMax Speech 2.6 HD and later
MiniMax has developed a competitive line of speech models that have received less attention in English-speaking markets. Speech 2.6 HD offers
It offers powerful expressive capabilities and compatibility with over 40 languages. It consistently ranks near the top in multiple evaluation rounds. According to one evaluation in January 2026, Speech 2.6 HD earned one of the highest rankings on the Artificial Analysis charts.
The Turbo version is built for agent applications, with a response time that stays below 250 milliseconds. MiniMax attracts users by offering a strong balance between price and performance. It provides emotional nuance that rivals costlier models at the highest level. Later high-definition versions, like Speech 2.8 HD, appear in 2026 evaluations at higher price points.
MiniMax suits multilingual projects that need rich delivery without paying top-tier costs.
#5 Hume Octave 2
Hume AI follows a different design direction. Octave 2 functions as a speech-language model that interprets meaning before generating audio. It produces speech shaped by emotional context rather than rigid word-by-word rules. The model adjusts its delivery automatically as the text moves from relaxed to pressing. It achieves this without needing explicit markers or instructions in the script.
There are, however, some limitations. Its supported language set is smaller than what leading multilingual models offer. Integrating custom voice cloning via a paid tier typically requires a direct sales conversation. Pricing sources differ significantly, ranging from below $10 to above $100 per million characters depending on the usage tier. Verify current pricing with Hume directly before including it in your budget planning.
Octave 2 is suitable for scenarios where vocal tone plays a central role. Use cases include interactive companion agents, mental wellness applications, and customer-facing systems where a flat delivery style would diminish the experience.
#6 Cartesia Sonic 3 and Sonic 3.5
Cartesia focuses heavily on speed. Sonic uses a State Space Model, or SSM, design instead of the more common transformer architecture. SSM processing grows in a linear rather than exponential way with longer sequences, keeping response times low even under heavy usage. Cartesia states that the model’s internal processing takes under 100 milliseconds, with a total time-to-first-audio close to 82 milliseconds on Sonic 3.5.
Sonic 3 was released toward the end of 2025. Sonic 3.5 came out in May 2026 and is now the recommended stable version. Both cover 42 languages, including nine languages from India, and offer more than 500 voice options. Cartesia briefly held the top position on the Artificial Analysis chart with Sonic 3.5 before other models surpassed it. The line delivers enhanced speech melody, a wider emotional range, real-time natural laughter, and voice cloning from brief audio samples.
Sonic 3 suits real-time dialogue systems where quick response is the critical requirement. As a voice-generation-only solution, teams must pair it with separate speech recognition and language model components.
#7 Speechify SIMBA 3.0
Speechify markets SIMBA 3.0 as an affordable yet high-quality option. The company reported a seventh-place ranking on the Artificial Analysis chart in May 2026. The reported ELO score was around 1,159, with a list price near $10 per million characters. This made it among the most affordable models in the reported top ten.
Because these figures come from Speechify’s own announcement, confirm them through independent sources before committing to the model. SIMBA 3.0 is a good fit for teams seeking competitive benchmark performance without the price of premium-tier systems.
#8 OpenAI gpt-4o-mini-tts and the Realtime line
OpenAI announced gpt-4o-mini-tts in March 2025. It is based on the GPT-4o-mini architecture. Its key feature is the ability to be guided by plain-language instructions. Developers can direct the model on how to deliver the words, not just what the words are. For example, an instruction could be “speak in a calm, empathetic way.” OpenAI also launched a testing tool for it at OpenAI.fm.
OpenAI released an updated version, gpt-4o-mini-tts-2025-12-15, in December 2025. It reports roughly 35 percent fewer word-level errors on the Common Voice and FLEURS tests. The update also enhanced Custom Voices, a feature that lets companies create a branded voice from a sample recording. The service offers 13 built-in voices and supports over 50 languages. OpenAI charges $0.60 for every million text input tokens and $12 for every million audio output tokens, which comes to roughly $0.015 per minute of generated audio. OpenAI describes it as their most advanced and dependable voice-generation model; the earlier tts-1 and tts-1-hd selections remain accessible.
For interactive agents, OpenAI’s Realtime series has progressed further. The Realtime API became widely available in August 2025. In May 2026, OpenAI launched GPT-Realtime-2, their first voice model with GPT-5-level reasoning. It can manage tool calls, interruptions, and corrections during live, speech-to-speech conversations. OpenAI also introduced GPT-Realtime-Translate and GPT-Realtime-Whisper for real-time translation and transcription.
gpt-4o-mini-tts suits teams already using OpenAI’s platform who need a low-cost, instruction-guided voice model. The Realtime models are better for full speech-to-speech agent setups.
Open-weight models
As of late May 2026, the highest tier of the Artificial Analysis chart was still dominated by proprietary, closed-source models. Open-weight models still hold significant value. They enable self-hosting, deep customization, on-device processing, and full data control. They can eliminate per-character API fees, replacing them with your own computing costs. However, usage licenses differ. Some are freely permissive, while others are restricted to research purposes and require a separate paid license for commercial projects. Always review the license terms before building on any of these models.
#01 Kokoro 82M
Kokoro stands as one of the most resource-efficient open-weight models available. It no longer leads the open-weight rankings; on the current Artificial Analysis chart, its ELO sits around 1,058, trailing Fish Audio S2 Pro, Step Audio EditX, and Voxtral TTS. It is small, with just 82 million internal parameters. Its design builds on StyleTTS2 and ISTFTNet. It skips diffusion and encoder stages, making generation faster.
In the Trelis “Tricky TTS” evaluation, Kokoro achieved a 4.5 Mean Opinion Score and a 17% Character Error Rate. This was the highest quality score among the models tested in that round. It runs efficiently on modest hardware, including standard processors. Hosted service rates come in under $1 per million input characters, around $0.65 according to one current provider. The model weights were first released in late December 2024, with a v1.0 update following in 2025. It covers about 15 languages and is distributed under the Apache 2.0 license.
Kokoro is ideal for budget-conscious or edge-device deployments where compact size and fast speed are priorities. Features like emotion markup and cross-lingual voice transfer remain experimental and work best with English.
#02 Fish Audio S2 Pro
Fish Audio S2 Pro holds the highest rank among open-weight models on the current Artificial Analysis chart, with an ELO around 1,123. Fish Audio reports training on more than 10 million hours of audio spanning over 80 languages. The model contains 5 billion parameters. Its design uses a Dual-Autoregressive method with a Residual Vector Quantization audio encoder. It accepts open-domain emotional tags, natively supports multiple speakers in a single output, and maintains latency under 150 milliseconds.
There is a crucial licensing detail. S2 Pro is distributed under the Fish Audio Research License, which is not a fully permissive open-source license. Use for research and non-commercial projects is
Commercial deployment is permitted with a Fish Audio commercial license.
Ideal for teams seeking open-weight solutions with professional-grade capabilities, as long as licensing is arranged prior to implementation.
#03 IndexTTS-2
Developed by IndexTeam, IndexTTS-2 represents a significant step forward in zero-shot text-to-speech technology. Its most notable capability lies in duration management — an essential asset for video dubbing tasks where speech must align with predetermined timeframes. Additionally, the system distinguishes between vocal characteristics and emotional expression, allowing separate manipulation of speaker identity and affective delivery.
The underlying design integrates GPT-based latent representations alongside a progressive training methodology across multiple phases. A flexible instruction framework, developed through Qwen3 fine-tuning, directs emotional nuance via descriptive text inputs. Evaluation results from the research group indicate improvements over existing zero-shot approaches in speech accuracy, voice reproduction, and emotional alignment across multiple test datasets.
Best suited for high-stakes dubbing work and nuanced speech generation where scheduling and parameter adjustment take priority. The availability of dual operational modes may introduce additional setup complications.
#04 CosyVoice 2
Derived from the FunAudioLLM initiative, CosyVoice2-0.5B operates with 0.5 billion parameters. It prioritizes minimal-delay streaming voice generation and accommodates zero-shot voice replication. The compact architecture enables efficient integration into streaming systems operating under hardware constraints.
Well-matched for interactive scenarios where immediate responsiveness is required without proprietary dependencies.
#05 VibeVoice
Microsoft’s VibeVoice addresses extended speech production needs. The 1.5-billion-parameter framework accommodates contextual windows up to 64,000 tokens, facilitating creation of audio sequences approaching 90 minutes. This capability benefits podcasting and extended narrative content.
Certain boundaries exist to note. Training encompasses English and Chinese exclusively, with sequential rather than simultaneous multi-speaker output. Best aligned with sustained content generation within the supported language pair.
Additional noteworthy systems
The landscape extends beyond the rankings outlined above. Additional systems have gained recognition across evaluation platforms and warrant inclusion during candidate assessment. xAI introduced its autonomous speech synthesis platform in 2026. StepAudio 2.5 TTS positions itself among high-end commercial offerings. Mistral unveiled Voxtral TTS — a 4-billion-parameter architecture with per-character billing approaching $0.016 per thousand characters. Step Audio EditX alongside Magpie-Multilingual emerge as formidable open-weight alternatives. Alibaba’s Qwen3-TTS and Maya1 contribute supplementary versatile and cross-lingual alternatives. Each platform possesses distinct advantages suited to particular requirements.
Selecting systems by deployment scenario
The marketplace features diverse solutions rather than a unified best choice. Define the application task first, then align capabilities accordingly.
Voice assistants requiring immediate response: Response time is paramount — end-user patience is limited. Cartesia Sonic 3.5 delivers the fastest performance through its SSM structure at approximately 82 milliseconds total processing. Inworld’s real-time offerings balance speed with affordability. Deepgram Aura-2 represents another responsive option, producing output within 90 milliseconds. ElevenLabs Flash v2.5 benefits from shared pronunciation patterns across deployment modes. When complete speech-to-speech capability is essential, OpenAI’s GPT-Realtime-2 warrants attention.
Audiobooks and extended narration: Fidelity takes precedence with timing unimportant. ElevenLabs v3 benchmarks at the highest tier for realistic content delivery. Gemini 3.1 Flash TTS provides comprehensive parameter management alongside segmentation features for managing extended material. Among open-source alternatives, VibeVoice manages sustained sequencing across those two languages.
Multi-language production: Breadth of coverage paired with output standardization are critical. Gemini 3.1 Flash TTS alongside ElevenLabs v3 deliver support across more than seventy languages. MiniMax Speech spans over forty languages with economical pricing. Fish Audio S2 Pro leads among independent implementations with eighty-plus language capacity, contingent upon appropriate commercial licensing.
Narrative and multi-character scenarios: Character depth with multi-speaker management form the foundation. ElevenLabs v3 Text to Dialogue accommodates overlapping speech patterns and conversational interruptions. Gemini 3.1 Flash TTS introduces scene-level directionality and individual speaker configuration. Inworld specializes in interactive entertainment constructs.
Affective communication: Hume Octave 2 interprets semantic intent with adaptive vocal modulation suited to social computing and experience-sensitive deployments.
Edge deployment and budget optimization: Independent operation eliminates ongoing service fees. Kokoro functions on standard processors within modest resource parameters. CosyVoice 2 enables real-time delivery capabilities. Each involves component-level concessions in exchange for governance benefits.
Synchronized video voiceover: IndexTTS-2 supplies temporal calibration to align spoken output with visual sequences, distinguishing it among generalized frameworks.
Marktechpost’s Visual Explainer
Marktechpost · TTS Guide 2026
01 / 11
Key Takeaways
- There is no single best model — your decision should hinge on the most critical factor for you: speed, quality, breadth of language support, or cost.
- The current top performers on leaderboards include: Gemini 3.1 Flash TTS, Inworld Realtime TTS-2, Cartesia Sonic 3.5, and ElevenLabs v3.
- Scores and rankings fluctuate every week, so treat any ELO snapshot as time-sensitive rather than permanent.
- Cartesia Sonic 3.5 leads real-time latency performance at approximately 82 ms end-to-end, with Deepgram Aura-2 closely behind.
- ElevenLabs v3 reached general availability in early 2026 and stands out for expressive, multi-speaker storytelling and narration.
- Gemini 3.1 Flash TTS does not support streaming and is capped at 32k tokens per session — it is meant for recitation, not real-time conversational use.
- Fish Audio S2 Pro ranks as the top open-weight model but carries a research-only license; commercial use requires acquiring a paid license.
- Kokoro remains the most resource-efficient open option but is no longer the top-ranked among open-weight models.
- Inworld pricing follows a tiered model: $25/$35 for on-demand use, scaling down to $5/$10 at enterprise volume discounts.
- Public benchmarks can help you narrow the options, but the final decision should be based on your own tests with your specific content.
Sources:
Benchmarks and official leaderboard data
Commercial model details (official provider sources)
Open-weight models (model documentation and official pages)
Also, feel free to follow us on Twitter and be sure to join our 150k+ ML SubReddit community and subscribe to our Newsletter. Already on Telegram? You can join our Telegram group as well!
Interested in partnering with us to promote your GitHub repository, Hugging Face page, product launch, webinar, or event? Get in touch with us



