For many people, our first experiences with AI brokers have been by typing right into a chat field. And for these of us utilizing brokers daily, we’ve seemingly gotten superb at writing detailed prompts or markdown recordsdata to information them.
However a number of the moments the place brokers could be most helpful are usually not at all times text-first. You is likely to be on a protracted commute, juggling just a few open periods, or simply wanting to talk naturally to an agent, have it converse again, and proceed the interplay.
Including voice to an agent shouldn’t require shifting that agent right into a separate voice framework. Right this moment, we’re releasing an experimental voice pipeline for the Brokers SDK.
With @cloudflare/voice, you possibly can add real-time voice to the identical Agent structure you already use. Voice simply turns into one other means you possibly can discuss to the identical Sturdy Object, with the identical instruments, persistence, and WebSocket connection mannequin that the Brokers SDK already supplies.
@cloudflare/voice is an experimental package deal for the Brokers SDK that gives:
withVoice(Agent)for full dialog voice brokerswithVoiceInput(Agent)for speech-to-text-only use instances, like dictation or voice searchuseVoiceAgentanduseVoiceInputhooks for React appsVoiceClientfor framework-agnostic purchasersConstructed-in Employees AI suppliers, with the intention to get began with out exterior API keys:
This implies now you can construct an agent that customers can discuss to in actual time over a single WebSocket connection, whereas retaining the identical Agent class, Sturdy Object occasion, and the identical SQLite-backed dialog historical past.
Simply as importantly, we would like this to be greater than one mounted default stack. The supplier interfaces in @cloudflare/voice are deliberately small, and we would like speech, telephony, and transport suppliers to construct with us, so builders can combine and match the appropriate parts for his or her use case, as an alternative of being locked right into a single voice structure.
Right here’s the minimal server-side sample for a voice agent within the Brokers SDK:
import "assistant",
content material: m.content material
from "agents";
import
position: m.position as "user" from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent "assistant",
content material: m.content material
export default "assistant",
content material: m.content material
satisfies ExportedHandler;
That’s the entire server. You add a steady transcriber, a text-to-speech supplier, and implement onTurn().
On the shopper aspect, you possibly can connect with it with a React hook:
import { useVoiceAgent } from "@cloudflare/voice/react";
operate App() {
const {
standing,
transcript,
interimTranscript,
startCall,
endCall,
toggleMute
} = useVoiceAgent({ agent: "my-agent" });
return (
Standing: {standing}
{interimTranscript && {interimTranscript}
}
{transcript.map((msg, i) => (
-
{msg.position}: {msg.textual content}
))}
);
}
If you’re not utilizing React, you should utilize VoiceClient instantly from @cloudflare/voice/shopper.
How the voice pipeline works
With the Brokers SDK, each agent is a Sturdy Object — a stateful, addressable server occasion with its personal SQLite database, WebSocket connections, and utility logic. The voice pipeline extends this mannequin as an alternative of changing it.
At a excessive stage, the move appears to be like like this:
Right here’s how the pipeline breaks down, step-by-step:
Audio transport: The browser captures microphone audio and streams 16 kHz mono PCM over the identical WebSocket connection the agent already makes use of.
STT session setup: When the decision begins, the agent creates a steady transcriber session that lives at some point of the decision.
STT enter: Audio streams constantly into that session.
STT flip detection: The speech-to-text mannequin itself decides when the person has completed an utterance and emits a secure transcript for that flip.
LLM/utility logic: The voice pipeline passes that transcript to your
onTurn() technique.TTS output: Your response is synthesized to audio and despatched again to the shopper. If
onTurn()returns a stream, the pipeline sentence-chunks it and begins sending audio as sentences are prepared.Persistence: The person and agent messages are endured in SQLite, so dialog historical past survives reconnections and deployments.
Why voice ought to develop with the remainder of your agent
Many voice frameworks give attention to the voice loop itself: audio in, transcription, mannequin response, audio out. These are vital primitives, however there’s much more to an agent than simply voice.
Actual brokers working in manufacturing will develop. They want state, scheduling, persistence, instruments, workflows, telephony, and methods to maintain all of that constant throughout channels. As your agent grows in complexity, voice stops being a standalone characteristic and turns into half of a bigger system.
We wished voice within the Brokers SDK to begin from that assumption. As an alternative of constructing voice as a separate stack, we constructed it on high of the identical Sturdy Object-based agent platform, so you possibly can pull in the remainder of the primitives you want with out re-architecting the appliance later.
Voice and textual content share the identical state
A person would possibly begin by typing, swap to voice, and return to textual content. With Brokers SDK, these are all simply completely different inputs to the identical agent. The identical dialog historical past lives in SQLite, and the identical instruments can be found. This provides you each a cleaner psychological mannequin and a a lot less complicated utility structure to motive about.
Decrease latency comes from…
Voice experiences really feel good or dangerous in a short time. As soon as a person stops talking, the system must transcribe, assume, and begin talking again quick sufficient to really feel conversational.
Plenty of voice latency shouldn’t be pure mannequin time. It’s the price of bouncing audio and textual content between completely different providers elsewhere. Audio must go to STT, transcripts go to an LLM, and responses go to a TTS mannequin – and every handoff provides community overhead.
With the Brokers SDK voice pipeline, the agent runs on Cloudflare’s community, and the built-in suppliers use Employees AI bindings. That retains the pipeline tighter and reduces the quantity of infrastructure you need to sew collectively your self.
A voice agent interplay feels rather more pure if it speaks the primary sentence shortly (additionally known as Time-to-First Audio). When onTurn() returns a stream, the pipeline chunks it into sentences and begins synthesis as sentences full. Meaning the person can hear the start of the reply whereas the remaining continues to be being generated.
A extra real looking backend
Here’s a fuller instance that streams an LLM response and begins talking it again, sentence by sentence:
import { Agent, routeAgentRequest } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAITTS,
kind VoiceTurnContext
} from "@cloudflare/voice";
import { streamText } from "ai";
import { createWorkersAI } from "workers-ai-provider";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript: string, context: VoiceTurnContext) {
const ai = createWorkersAI({ binding: this.env.AI });
const outcome = streamText({
mannequin: ai("@cf/cloudflare/gpt-oss-20b"),
system: "You are a helpful voice assistant. Be concise.",
messages: [
...context.messages.map((m) => ( "assistant",
content: m.content
)),
{ role: "user" as const, content: transcript }
],
abortSignal: context.sign
});
return outcome.textStream;
}
}
export default {
async fetch(request: Request, env: Env) {
return (
(await routeAgentRequest(request, env)) ??
new Response("Not found", { standing: 404 })
);
}
} satisfies ExportedHandler;
Context.messages provides you up-to-date SQLite-backed dialog historical past, and context.sign lets the pipeline abort the LLM name if the person interrupts.
Voice as an enter: withVoiceInput
Not each speech interface wants to talk again. Typically you may want dictation, transcription, or voice search. For these use instances, you should utilize withVoiceInput
import { Agent, kind Connection } from "agents";
import { withVoiceInput, WorkersAINova3STT } from "@cloudflare/voice";
const InputAgent = withVoiceInput(Agent);
export class DictationAgent extends InputAgent {
transcriber = new WorkersAINova3STT(this.env.AI);
onTranscript(textual content: string, _connection: Connection) {
console.log("User said:", textual content);
}
}
On the shopper, useVoiceInput provides you a light-weight interface centered on transcriptions:
import { useVoiceInput } from "@cloudflare/voice/react";
const { transcript, interimTranscript, isListening, begin, cease, clear } =
useVoiceInput({ agent: "DictationAgent" });
That is helpful when speech is an enter technique, and also you don’t want a full conversational loop.
Voice and textual content on the identical connection
The identical shopper can name sendText(“What’s the weather?”), which bypasses STT and sends the textual content on to onTurn(). Throughout an energetic name, the response might be spoken and proven as textual content. Exterior a name, it could possibly stay text-only.
This provides you a genuinely multimodal agent, with out splitting the implementation into completely different code paths.
What else are you able to construct?
As a result of a voice agent continues to be an agent, all the traditional Brokers SDK capabilities nonetheless apply.
You may greet a caller when a session begins:
import { Agent, kind Connection } from "agents";
import { withVoice, WorkersAIFluxSTT, WorkersAITTS } from "@cloudflare/voice";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async onTurn(transcript: string) {
return `You mentioned: ${transcript}`;
}
async onCallStart(connection: Connection) {
await this.converse(connection, "Hi! How can I help you today?");
}
}
You may schedule spoken reminders and expose instruments to your LLM identical to some other agent:
import { Agent } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAITTS,
kind VoiceTurnContext
} from "@cloudflare/voice";
import { streamText, device } from "ai";
import { createWorkersAI } from "workers-ai-provider";
import { z } from "zod";
const VoiceAgent = withVoice(Agent);
export class MyAgent extends VoiceAgent {
transcriber = new WorkersAIFluxSTT(this.env.AI);
tts = new WorkersAITTS(this.env.AI);
async speakReminder(payload: { message: string }) {
await this.speakAll(`Reminder: ${payload.message}`);
}
async onTurn(transcript: string, context: VoiceTurnContext) {
const ai = createWorkersAI({ binding: this.env.AI });
const outcome = streamText({
mannequin: ai("@cf/cloudflare/gpt-oss-20b"),
messages: [
...context.messages.map((m) => ( "assistant",
content: m.content
)),
{ role: "user" as const, content: transcript }
],
instruments: {
set_reminder: device({
description: "Set a spoken reminder after a delay",
inputSchema: z.object({
message: z.string(),
delay_seconds: z.quantity()
}),
execute: async ({ message, delay_seconds }) => {
await this.schedule(delay_seconds, "speakReminder", { message });
return { confirmed: true };
}
})
},
abortSignal: context.sign
});
return outcome.textStream;
}
}
The voice pipeline additionally enables you to select a transcription mannequin dynamically per connection.
For instance, you would possibly choose Flux for conversational turn-taking and Nova 3 for higher-accuracy dictation. You may swap at runtime by overriding createTranscriber():
import { Agent, kind Connection } from "agents";
import {
withVoice,
WorkersAIFluxSTT,
WorkersAINova3STT,
WorkersAITTS,
kind Transcriber
} from "@cloudflare/voice";
export class MyAgent extends VoiceAgent {
tts = new WorkersAITTS(this.env.AI);
createTranscriber(connection: Connection): Transcriber {
const url = new URL(connection.url ?? "
const mannequin = url.searchParams.get("model");
if (mannequin === "nova-3") {
return new WorkersAINova3STT(this.env.AI);
}
return new WorkersAIFluxSTT(this.env.AI);
}
}
On the shopper, you possibly can cross question parameters by the hook:
const voiceAgent = useVoiceAgent({
agent: "my-voice-agent",
question: { mannequin: "nova-3" }
});
It’s also possible to intercept information between levels:
afterTranscribe(transcript, connection)beforeSynthesize(textual content, connection)afterSynthesize(audio, textual content, connection)
These hooks are helpful for content material filtering, textual content normalization, language-specific transformations, or customized logging.
Phone and transport choices
By default, the voice pipeline makes use of a single WebSocket connection as the only path for 1:1 voice brokers. However that’s not the one choice.
You may join telephone calls to the identical agent utilizing the Twilio adapter:
import { TwilioAdapter } from "@cloudflare/voice-twilio";
export default {
async fetch(request: Request, env: Env) {
if (new URL(request.url).pathname === "/twilio") {
return TwilioAdapter.handleRequest(request, env, "MyAgent");
}
return (
(await routeAgentRequest(request, env)) ??
new Response("Not found", { standing: 404 })
);
}
};
This lets the identical agent deal with internet voice, textual content enter, and telephone calls.
One caveat: the default Employees AI TTS supplier returns MP3, whereas Twilio expects mulaw 8kHz audio. For manufacturing telephony, it’s possible you’ll wish to use a TTS supplier that outputs PCM or mulaw instantly.
For those who want a transport that’s higher suited to tough community situations or will embody a number of individuals, the voice package deal additionally contains SFU utilities and helps customized transports. The default mannequin is WebSocket-native immediately, however we plan to develop extra adapters to hook up with our international SFU infrastructure.
The voice pipeline is provider-agnostic by design.
Underneath the hood, every stage is outlined by a small interface: a transcriber opens a steady session and accepts audio frames as they arrive, whereas a TTS supplier takes textual content and returns audio. If a supplier can stream audio output, the pipeline can use that too.
interface Transcriber {
createSession(choices?: TranscriberSessionOptions): TranscriberSession;
}
interface TranscriberSession {
feed(chunk: ArrayBuffer): void;
shut(): void;
}
interface TTSProvider {
synthesize(textual content: string, sign?: AbortSignal): Promise;
}
We didn’t need voice assist in Brokers SDK to solely work with one mounted mixture of fashions and transports. We wished the default path to be easy, whereas nonetheless making it straightforward to plug in different suppliers because the ecosystem grows.
The built-in suppliers use Employees AI, so you may get began with out exterior API keys:
WorkersAIFluxSTTfor conversational streaming STTWorkersAINova3STTfor dictation-style streaming STTWorkersAITTSfor text-to-speech
However the greater objective is interoperability. For those who preserve a speech or voice service, these interfaces are sufficiently small to implement with no need to know the remainder of the SDK internals. In case your STT supplier accepts streaming audio and may detect utterance boundaries, it could possibly fulfill the transcriber interface. In case your TTS supplier can stream audio output, even higher.
We might like to work on interoperability with:
STT suppliers like AssemblyAI, Rev.ai, Speechmatics, or any service with a real-time transcription API
TTS suppliers like PlayHT, LMNT, Cartesia, Coqui, Amazon Polly, or Google Cloud TTS
telephony adapters for platforms like Vonage, Telnyx, or Bandwidth
transport implementations for WebRTC information channels, SFU bridges, and different audio transport layers
We’re additionally concerned about collaborations that transcend particular person suppliers:
latency benchmarking throughout STT + LLM + TTS combos
multilingual assist and higher documentation for non-English voice brokers
accessibility work, particularly round multimodal interfaces and speech impairments
If you’re constructing voice infrastructure and wish to see a first-class integration, open a PR or attain out.
The voice pipeline is on the market immediately as an experimental package deal:
npm create cloudflare@newest -- --template cloudflare/agents-starter
Add @cloudflare/voice, give your agent a transcriber and a TTS supplier, deploy it, and begin speaking to it. It’s also possible to learn the API reference.
For those who construct one thing attention-grabbing, open a difficulty or PR on github.com/cloudflare/brokers. Voice shouldn’t require a separate stack, and we expect the perfect voice brokers would be the ones constructed on the identical sturdy utility mannequin as every part else.



