StepAudio 2.5 Arrives: StepFun's Breakthrough Voice Model Masters Emotions And Roleplay With Next-Gen RLHF

StepFun, an AI research firm based in Shanghai, has launched StepAudio 2.5 Realtime — an end-to-end, real-time speech large language model that lets users fully customize personas.

Unlike traditional pipeline systems that handle speech recognition, reasoning, and voice synthesis as separate sequential steps, StepAudio 2.5 Realtime processes everything in a unified manner within a single framework. You feed audio in, and audio comes out — no intermediate modules. The model supports both Chinese and English.

It’s accessible through a WebSocket API at wss://api.stepfun.com/v1/realtime, using the model identifier step-2.5-realtime.

The Three Core Innovations

The StepFun research team highlights three foundational architectural breakthroughs powering the model:

1. Persona Data Augmentation at Million-Sample Scale

The team began with more than 10,000 carefully written, high-quality personas and used algorithmic techniques to expand this foundation into a persona feature matrix containing millions of distinct profiles. This expanded dataset was then paired with millions of actual conversational samples during training. The goal is broad generalization — specifically, ensuring the model remains consistent and reliable even when handling challenging, niche, or uncommon conversational scenarios. Rather than manually crafting millions of persona examples, StepFun’s researchers algorithmically grew the dataset from a well-curated set of seed profiles.

2. Tailored RLHF Alignment for Roleplay Stability

One of the most common issues in conversational AI is “out-of-character” (OOC) drift — where the model gradually loses consistency with its assigned persona during a conversation. StepFun’s team carried out dedicated RLHF (Reinforcement Learning from Human Feedback) optimization focused specifically on maintaining persona fidelity during roleplay exchanges. RLHF is a training methodology where human preference data is used to construct a reward model that, in turn, steers the language model’s outputs. Applying this technique with a targeted emphasis on roleplay consistency is a deliberate and focused design decision.

3. Integrated Speech Understanding and Generation

StepAudio 2.5 Realtime builds on the text-to-speech capabilities of StepAudio 2.5 and tightly integrates speech understanding with speech generation through reinforcement learning. This integration enables what StepFun refers to as “global scene-level tonal setting” alongside “intra-sentence detail sculpting.” In practice, the model can establish an overall emotional tone for an entire response while simultaneously fine-tuning subtle acoustic nuances within individual sentences.

Paralinguistic Perception

A particularly notable technical strength of this model lies in its paralinguistic perception. Paralinguistics covers the non-verbal acoustic cues embedded in speech — elements such as vocal tone, speaking pace, pauses, sighs, and laughter. By interpreting these signals, the model can gauge the user’s emotional state and detect underlying intentions. For instance, it can recognize fatigue from a subdued tone or frustration from an accelerated speaking rate. Capturing these subtleties requires the model to work directly with raw audio features rather than relying solely on transcribed text.

On the paralinguistic comprehension benchmark, StepAudio 2.5 Realtime achieved a score of 82.18, demonstrating its ability to perceive vocal speed, emotional state, age, and other acoustic characteristics.

Benchmark Results

The StepFun research team carried out an extensive set of both subjective and objective evaluations, benchmarking StepAudio 2.5 Realtime against leading

Here is the paraphrased version of the article in HTML format:

Real-time voice models are assessed across five key performance areas.

Human evaluation is carried out using actual mobile app conversations that are rated by human evaluators. The results are as follows:

Human evaluation (subjective): 80.41
General dialogue (objective): 86.36
Automotive scenario (objective): 84.80
Spoken QA, spanning 11 audio understanding tasks (objective): 79.80
Paralinguistic comprehension (objective): 82.18

Key Takeaways

StepAudio 2.5 Realtime is a fully end-to-end real-time speech large language model, developed by Shanghai-based StepFun.
It leverages persona-specific RLHF and data augmentation at the million-scale to ensure consistent character behavior.
The model achieved the top ranking across all five benchmark dimensions, as tested in April 2026.
Paralinguistic comprehension — the ability to detect tone, speaking rate, and emotion from audio — stands out as a key technical advantage.
API access is available via WebSocket at wss://api.stepfun.com/v1/realtime using the model identifier step-2.5-realtime.

Be sure to check out the Model Card and Demo. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and subscribe to our Newsletter. Wait — are you on Telegram? Now you can join us on Telegram as well.

Looking to partner with us to promote your GitHub repo, Hugging Face page, product launch, webinar, or more? Get in touch with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Top Posts

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

StepAudio 2.5 Arrives: StepFun’s Breakthrough Voice Model Masters Emotions and Roleplay with Next-Gen RLHF

Beyond Guesswork: A Slurm-Powered Battle Plan for Benchmarking Distributed LLM Servers

Beyond Prompt Engineering: How 4 Context Bricks Silence RAG Hallucinations

Run Mythos Enhanced Coding Model Locally with llama.cpp on Raspberry Pi

Astryx: Meta’s Open-Source React Toolkit—150+ Accessible Components, 7 Themes, and a CLI Agent-Ready Design System

Endless Code: Mastering the Art of the 24-Hour Claude Agent

Unlock Peak Performance: Your Blueprint for Lightning-Fast Agentic Coding with Claude

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Speed, VRAM, Multi-GPU Smackdown: Unsloth, Axolotl, TRL, or LLaMA-Factory?

Secret Sabotage: How Hidden Azure DevOps PR Comments Can Hijack AI Agents

AI Jailbreak: OpenAI Models Breach Test Prison, Rig Hugging Face Leaderboard with Cheat Code

Precision Medicine Deposited: The Art of Microdispensing for Next-Gen Medical Devices

When the World Cup Collided with the Cloud: 2026’s Digital Traffic Surge

Skyways Unleashed: The US and Europe Race to Build the Future of Urban Air Travel

Trending

Migrate Your On-Prem ERP to Dynamics 365: A Cloud Transformation Journey

Supercharging Smart Homes: The Fibre Internet Revolution Behind IoT Awakening

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

StepAudio 2.5 Arrives: StepFun’s Breakthrough Voice Model Masters Emotions and Roleplay with Next-Gen RLHF

The Three Core Innovations

1. Persona Data Augmentation at Million-Sample Scale

2. Tailored RLHF Alignment for Roleplay Stability

3. Integrated Speech Understanding and Generation

Paralinguistic Perception

Benchmark Results

Key Takeaways

Related Posts