Voice AI has a dirty secret: most of it was never built for real conversation. The standard approach — take text, produce audio — comes straight from audiobook narration and voiceover work, where the model never listens to the person on the other end. That works well enough for a podcast intro. It falls apart when an upset user is trying to get help from an AI agent at 11 p.m.
Inworld AI is addressing that gap head-on with the release of Realtime TTS-2, a new voice model available as a research preview through its Inworld API and Inworld Realtime API. The model listens to the full audio of the conversation, picks up on the user’s tone, rhythm, and emotional state, and then accepts voice direction in plain English — the same way developers prompt an LLM.
What’s Actually Different Here
The core architectural difference with TTS-2 is that it works as a closed-loop system. The model takes the actual audio from previous turns of the conversation as input, not just a transcript — it hears how the user actually sounded. That’s a meaningful distinction. A transcript of “okay, fine” gives you the words. The audio of “okay, fine” tells you whether the person is relieved, resigned, or being sarcastic. TTS-2 is built to leverage that signal.
The same phrase lands differently after a joke than after bad news, and the model recognizes the difference because it heard the previous exchange. Tone, pacing, and emotional state carry forward on their own. In practice, audio context flows across turns within a Realtime session without developers having to pass explicit prior_audio fields or build extra plumbing.
Four Capabilities, One Model
The Inworld team is shipping TTS-2 with four key features, positioning the combination — rather than any single piece — as the differentiator.
- Voice Direction: It lets developers guide delivery using natural-language prompts inline at inference time. Rather than picking from a fixed emotion list like
[sad]or[excited], developers insert a bracket tag like[speak sadly, as if something bad just happened]directly in the text. Long, descriptive prompts outperform short labels — the model responds much better to full context than to single-word tags. Inline non-verbal markers like[laugh],[sigh],[breathe],[clear_throat], and[cough]can be placed anywhere in the text where the moment should happen, and the model renders them as audio events rather than spoken words. - Conversational Awareness: This is the closed-loop architecture described above — the architectural shift that sets TTS-2 apart from earlier-generation models that treat each sentence as a stateless generation call.
- Crosslingual support: A single voice identity is maintained across more than 100 languages, including mid-utterance language switches within one generation. No language flag is required — the model manages transitions on its own, keeping timbre, pitch, and character consistent across the switch. The top-tier languages deliver native-speaker quality, while the long tail is described as launch-window experimental, consistent with the model being released as a research preview.
- Advanced Voice Design: It generates a saved voice from a written prompt with no reference audio needed. Developers can describe a person in prose, save the result as a reusable voice, and call it like any other voice in the application. Voice Design comes with three stability modes: Expressive (for live consumer conversation and companions), Balanced (the default for most agent workloads), and Stable (for IVR and professional deployments where pitch drift is unacceptable).
The Conversational Layer Underneath
Beyond the four key features, it highlights a set of behaviors that push speech further into what it describes as “person paying attention” territory. The most technically interesting is disfluencies: the model produces natural uh and um, self-corrections, mid-noun-phrase pauses, and trailing thoughts that signal warmth and recall rather than malfunction. Critically, different speaker profiles cluster fillers differently, and the model follows the rhythm — filler-as-energy sounds different from filler-as-hesitation. Voice cloning is also supported through a two-step API: upload a reference sample (5–15 seconds, clean, single speaker) to /voices/v1/voices:clone, receive a voice ID, and use it like any other voice.
Where It Fits in the Stack
TTS-2 is one layer in Inworld’s broader Realtime API pipeline. The full stack includes Realtime STT, which transcribes and profiles the speaker in a single pass — capturing age, accent, pitch, vocal style, emotional tone, and pacing as structured signals on the same connection. A Realtime Router that routes across 200+ models, selecting the appropriate model and tools based on the user’s state and conversation context. And TTS-2 at the output layer. The pipeline runs over a single persistent WebSocket connection, with sub-200ms median time-to-first-audio for the TTS layer.

The Bigger Picture
Realtime TTS 1.5 currently holds the top spot on the Artificial Analysis Speech Arena (as of May 5, 2026), outpacing Google (#2) and ElevenLabs (#3). The release of TTS-2 makes it clear that Inworld views raw audio quality as a challenge already overcome — and is now shifting its focus to the behavioral dimension: context-awareness, steerability, and maintaining a consistent identity across multiple languages.
Explore the Docs and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 130k+ ML SubReddit and subscribe to our Newsletter. Wait — are you on Telegram? You can now join us on Telegram as well.
Looking to collaborate with us to promote your GitHub repo, Hugging Face page, product launch, webinar, or more? Get in touch with us



