Supertone Unveils Supertonic V3: On-Device Text-to-Speech In 31 Languages With Enhanced Expression Tags And Fewer Reading Errors

Supertone has unveiled Supertonic 3, the latest version of its on-device, ONNX-powered text-to-speech engine. This release brings support for 31 languages, enhanced pronunciation accuracy, fewer instances of repeated or skipped words, and publicly available ONNX assets that remain compatible with v2. It delivers lightning-fast, on-device, multilingual, and highly accurate TTS.

What’s New from v2 to v3

Relative to Supertonic 2, the new version significantly cuts down on repeat and skip errors, boosts speaker consistency across shared languages, and broadens language support from just 5 to a total of 31. Version 2 covered English, Korean, Spanish, Portuguese, and French. Version 3 now includes Japanese, Arabic, Bulgarian, Czech, Danish, German, Greek, Estonian, Finnish, Croatian, Hungarian, Indonesian, Italian, Lithuanian, Latvian, Dutch, Polish, Romanian, Russian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, and Vietnamese — spanning 31 ISO language codes in all. A special na fallback option is also available for text in languages that are either unidentified or fall outside the supported range.

The model has grown slightly to handle the expanded language set. With roughly 99 million parameters across its public ONNX assets, Supertonic 3 remains far more compact than open TTS systems in the 0.7B to 2B parameter range. This smaller footprint offers real-world benefits in terms of download size, initialization speed, and on-device inference performance. The update also brings the total disk space required for the public ONNX assets to 404 MB. On top of that, Supertone recently introduced the Voice Builder, a tool that lets developers build custom, edge-ready TTS models using their own recorded voice samples.

A notable addition in v3 that was absent in v2 is support for expressive tags. Supertonic 3 recognizes simple expression markers such as , , and . These allow you to weave prosodic cues straight into your input text — no extra preprocessing step or separate expressiveness model needed. For developers working on voice interfaces or accessibility applications, this means you can embed breathing pauses or laughter directly within your text payload.

Architecture and Runtime

The core architecture remains consistent with earlier versions: a speech autoencoder that converts waveforms into continuous latent representations, a flow-matching-based text-to-latent module that maps text to audio features, and a duration predictor that governs natural speech timing. Flow matching is a generative modeling approach that learns a vector field to transform a simple distribution into a target one — it samples more quickly than diffusion models at low step counts, which is what enables Supertonic to generate usable output in as few as 2 inference steps. To further enhance output quality, v3 incorporates Length-Aware Rotary Position Embedding (LARoPE) for improved text-speech alignment and applies a Self-Purifying Flow Matching technique during training to maintain resilience against noisy data labels.

In terms of runtime performance, Supertonic 3 runs efficiently on CPU — even outpacing larger baseline models benchmarked on A100 GPUs — while consuming significantly less memory. It has no GPU dependency, making local, browser-based, and edge deployments far more straightforward.

Reading Accuracy

Across the languages tested, Supertonic 3 maintains a competitive WER/CER range compared to much larger open TTS models like VoxCPM2, all while keeping a lightweight on-device deployment profile. WER (Word Error Rate) and CER (Character Error Rate) are standard TTS intelligibility benchmarks: a passage is synthesized, ASR is run on the output, and the transcription is compared against the original text. CER is applied for languages without clear word boundaries, while the rest use WER. The system’s efficiency shines on extreme edge hardware — it achieves an average RTF of 0.3x on an Onyx Boox Go 6 (an E-ink e-reader) running in airplane mode. Additionally, the ecosystem has grown to include Flutter (with macOS support), .NET 9, and Go, while the web implementation uses onnxruntime-web for fully client-side execution.

Text Normalization

A standover feature carried forward from v2 is built-in text normalization. Supertonic handles complex surface forms — financial expressions like $5.2M, phone numbers with area codes and extensions like (212) 555-0142 ext. 402, time and date formats like 4:45 PM on Wed, Apr 3, 2024, and technical units like 2.3h and 30kph — without requiring any preprocessing pipeline or phonetic annotations. The financial expression “$5.2M” should be read as “five point two million dollars,” and “$450K” as “four hundred fifty thousand dollars.” All four competing systems failed on this front. The technical unit “2.3h” should be read as “two point three hours” and “30kph” as “thirty kilometers per hour.” All four competitors failed this category as well. The competing systems evaluated include ElevenLabs Flash v2.5, OpenAI TTS-1, Gemini 2.5 Flash TTS, and Microsoft.

Category	Input Example	Supertonic 3	ElevenLabs / OpenAI / Gemini / Microsoft
Financial Expression	$5.2M / $450K	✓	✗ All four failed
Time & Date	4:45 PM, Wed Apr 3	✓	✗ All four failed
Phone Number	(212) 555-0142 ext. 402	✓	✗ All four failed
Technical Unit	2.3h at 30kph	✓	✗ All four failed

Top Posts

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Supertone Unveils Supertonic v3: On-Device Text-to-Speech in 31 Languages with Enhanced Expression Tags and Fewer Reading Errors

Supertonic 3: On-Device TTS,
Now Supporting 31 Languages

Four Key Enhancements Over Supertonic 2

Up and Running in Under a Minute

Basic Python Usage

31 Supported Languages + `na` Fallback

Handles Complex Inputs Without Pre-Processing

Runs Everywhere — 11 Platforms, No GPU Required

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Trending

Charting the Vessel Storm: A Proteomic Blueprint for Vasculitis Remission

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Supertone Unveils Supertonic v3: On-Device Text-to-Speech in 31 Languages with Enhanced Expression Tags and Fewer Reading Errors

What’s New from v2 to v3

Architecture and Runtime

Reading Accuracy

Text Normalization

Getting Started

Marktechpost’s Visual Guide

Supertonic 3: On-Device TTS,Now Supporting 31 Languages

Four Key Enhancements Over Supertonic 2

Up and Running in Under a Minute

Basic Python Usage

31 Supported Languages + na Fallback

Handles Complex Inputs Without Pre-Processing

Runs Everywhere — 11 Platforms, No GPU Required

Key Takeaways

Related Posts

Supertonic 3: On-Device TTS,
Now Supporting 31 Languages

31 Supported Languages + `na` Fallback