All posts
VoiceProductionBedrock

Building Voice Agents: Everything You Know About Chat Agents Still Applies

A voice agent is not a new kind of AI — it's the agent you already know, wearing a microphone. This is the honest guide: the two ways to build one, why latency is the whole product, where Amazon Nova Sonic fits, and how to ship one you can actually talk to.

AS
AgentSwarms Authors
July 2, 2026· 16 min read·
VoiceProductionBedrock

The first time we put a voice in front of one of our agents, it was genuinely magical for about four seconds. You talk, it thinks, it talks back — the sci-fi thing. Then a colleague tried to interrupt it mid-sentence to correct a detail, and it just… kept going. Read its entire paragraph over the top of her while she repeated herself, louder, like you do with a bad phone line. The demo went from “this is the future” to “please stop talking” in one turn. That gap — between a voice demo and a voice agent — is what this post is about.

Here's the reassuring part, and the whole thesis of this piece: a voice agent is not a new species of AI. It's the chat agent you already know how to build, with audio bolted onto the two ends. The system prompt, the tools, the retrieval, the memory, the guardrails, the evals — every principle you've learned carries straight over. What's genuinely new is a thin, unforgiving layer around it: capturing speech, deciding when someone stopped talking, and getting a reply out fast enough that it feels like a conversation instead of a walkie-talkie.

🎙️ Speech
→ text
The same agent core
📜System prompt🔧Tools📚RAG / knowledge🧵Memory🛡️Guardrails
🔊 Text
→ speech

Everything you learned building chat agents still applies. Voice only bolts audio onto the two ends — the reasoning in the middle is unchanged.

The uncomfortable truth for anyone hoping voice is a fresh start: the hard part in the middle is the exact agent you've already built. Voice adds ears and a mouth, not a new brain.

Keep that picture in mind, because it's the antidote to the biggest mistake in voice: treating it as a separate product with its own rules. It isn't. If your chat agent hallucinates, your voice agent hallucinates — out loud, where it's harder to catch. If your chat agent has no guardrails, neither does the one that's now speaking to a customer on the phone. Fix the agent first. Then make it talk.

Two ways to build one

There are exactly two architectures, and almost every product is one, the other, or a deliberate mix. The first is the cascaded pipeline: three separate models in a row — speech-to-text (STT), your LLM, then text-to-speech (TTS). The second is a unified speech-to-speech model: a single multimodal model that takes audio in and emits audio out directly, with no text stop in the middle. Toggle between them:

🎙️ Audio📝 STT🧠 LLM🔊 TTS🔈 Audio
LatencyHigher — three hops
Lowest — one model(speech-to-speech)
DebuggabilityHigh — read the transcript
Low — audio in, audio out(speech-to-speech)
Tools / RAG / memoryFull — it's a normal LLM step
Model-dependent(speech-to-speech)
Prosody & emotionFlatter — text drops tone
Natural — hears & speaks tone(speech-to-speech)
Swap a componentEasy — mix vendors
All-or-nothing(speech-to-speech)
Cascaded vs speech-to-speech. Cascaded is transparent and model-agnostic — you can read the transcript and swap any of the three parts. Speech-to-speech is faster and more natural, but it's a black box and you're wedded to one model's tool support.

The cascaded route is where almost everyone should start, and it's what powers most production voice agents today. The reason is boring and decisive: the middle stage is a normal LLM call. That means the transcript is right there to log, your tools and RAG and guardrails plug in exactly as they do in chat, and if the answer is wrong you can see why — was it a bad transcription, or a bad reply? A speech-to-speech model gives you audio in and audio out and not much to inspect in between. That's a wonderful property for latency and a miserable one at 2am when something's broken.

STT vs TTS vs “speech model” — the vocabulary

STT (speech-to-text, a.k.a. ASR) turns your mic audio into a transcript. TTS (text-to-speech) turns text back into audio. A speech-to-speech (or “realtime”) model collapses STT + reasoning + TTS into one model that never surfaces text at all. Cascaded = STT + LLM + TTS as three parts. Speech-to-speech = one part.

One spoken turn, start to finish

Whichever architecture you pick, a single turn is the same shape: you speak, the words become a request, the agent reasons (maybe calling a tool), and a reply comes back as sound. Press play and watch it move — the only thing that changes between cascaded and speech-to-speech is whether that middle box is one model or three:

🎙️You speakmic audio
📝Speech → texttranscript
🧠The agentprompt · tools · RAG
🔊Text → speechspoken reply

Three models in a loop. The middle box is an ordinary agent — swap in any chat agent you've already built and it just works.

The loop that repeats once per conversational turn. In cascaded builds each box is a separate model call you can log; in speech-to-speech the middle three collapse into one.

The latency budget is the entire product

Here is the number that governs everything: people notice conversational lag somewhere north of 300–500 milliseconds, and past ~800ms it stops feeling like a conversation and starts feeling like a bad connection — which is exactly when they talk over you. That budget has to cover the whole loop: detecting that you stopped speaking, the final transcription, the model's time-to-first-token, the first chunk of synthesized audio, and the round-trip network. Miss it and no amount of model quality saves you.

The single lever that matters is streaming everything. Don't wait for the user to finish to start transcribing — transcribe partials live. Don't wait for the full reply to start speaking — feed the first sentence into TTS while the model is still writing the third. The difference is not subtle. Flip the toggle:

Stream every stage
End-of-turn (VAD)
120ms
Final STT
90ms
LLM first token
230ms
First TTS chunk
140ms
Network + play
110ms
690ms to first sound — under the ~800ms line. Feels like a conversation.
The same turn, streamed vs not. Waiting for each stage to fully complete before starting the next stacks the delays into something unusable. Streaming overlaps them and drops the time-to-first-sound under the line.
The demo trap

A non-streaming voice agent feels fine when you built it and you're speaking slowly, one clean sentence at a time. It falls apart the instant a real person talks like a real person — fast, with filler words and mid-thought corrections. Test with someone who doesn't know how it works.

Turn-taking is where “robotic” actually lives

Two mechanisms decide whether your agent feels human, and neither is the model. The first is endpointing — knowing when the user has actually finished, not just paused. A naive fixed timeout cuts fast talkers off and makes slow talkers wait; good endpointing uses voice-activity detection (VAD) plus the words themselves to tell “I need a refund…” (still going) from “I need a refund.” (done). The second is barge-in: the moment the user starts talking, you stop playback and throw away the half-spoken reply. This is the exact bug from our opening story:

🔊 Agent
"Sure — your order shipped Tuesday and should arrive by Friday, tracking number one two three…"
🎙️ Caller (cuts in)
"—wait, I just need the tracking link."
Playback stops the instant the caller speaks. The half-said sentence is dropped, STT re-opens. Feels human.
Barge-in on vs off. Without it, a long-winded agent is physically impossible to interrupt — the thing everyone hates about bad IVR systems. With it, the caller is always in control.

Prompting for the ear, not the eye

This is the one place a chat prompt genuinely does need to change. The same model that writes a beautiful bulleted, bolded, link-studded answer for a screen produces unlistenable audio — because there's no such thing as a spoken bullet point, and “see https://…/orders/9f3a-… for details” read aloud is a small act of cruelty. Your voice system prompt needs a few explicit rules:

  • Short turns. One to three spoken sentences. Cap max-tokens low — brevity is a feature, not a limitation.
  • No markdown, ever. No lists, headings, code blocks, or tables. They're invisible in audio and make the TTS stumble.
  • Spell out or skip the ugly bits. Read “order number four-two-one,” not “#421”; offer to text or email long URLs and IDs rather than reciting them.
  • Confirm the risky things. Names, numbers, and any action worth doing — read them back. Transcription will mishear “Sarah” as “Sara” and “fifteen” as “fifty.”
  • One question at a time. Voice has no scroll-back; a wall of three questions gets one answer and two dropped.

None of that is exotic — it's just prompt engineering with a new set of constraints, and it's exactly the kind of thing you tune by listening, not reading. Which is the whole reason a voice playground exists: you say something awkward, hear the reply, and immediately know whether the length rule is working.

Amazon Nova Sonic and the speech-to-speech end of the spectrum

Once your agent logic is solid and you're chasing the last bit of naturalness, the speech-to-speech models get interesting. Amazon Nova Sonic is a good concrete example — a unified speech model on Amazon Bedrock that takes streamed audio in and emits streamed, expressive audio out over a single bidirectional connection. Crucially it hears tone, not just words, and can speak with matched prosody — the thing a cascaded pipeline loses when it flattens your voice into plain text in the middle.

Amazon Nova Sonic
Unified speech model on Amazon Bedrock · one bidirectional stream
🎧Speech inStreamed audio — Sonic hears words AND tone
🧠Understand + reasonOne model; no separate STT or LLM step
🔧Tool use mid-streamCan call your functions before it answers
🗣️Expressive speech outStreamed audio back, matched prosody

The trade: lowest latency and natural prosody, at the cost of the readable transcript you get from a cascaded pipeline. Great for the last mile once your agent logic is solid.

Nova Sonic as one bidirectional stream: audio in, understanding + reasoning + optional tool calls, expressive audio out — no text checkpoint you can log in the middle. That's the trade for the latency and the prosody.

The API shape is different from a normal request/response call — it's a persistent bidirectional stream you push audio chunks into and read audio events out of, rather than “send prompt, await reply.” In rough sketch (this is illustrative, not copy-paste production code):

import boto3

br = boto3.client("bedrock-runtime")

# One long-lived bidirectional stream, not a request/response call.
stream = br.invoke_model_with_bidirectional_stream(
    modelId="amazon.nova-sonic-v1:0",
)

# Push mic audio in as it arrives (e.g. 16kHz PCM frames from the caller)…
async def send_mic(stream, mic):
    async for frame in mic:                       # streamed, not buffered
        await stream.input.send(audio_event(frame))

# …and read model events out concurrently: audio to play, plus tool calls.
async def play_replies(stream):
    async for event in stream.output:
        if event.type == "audioOutput":
            speaker.play(event.audio)             # start playing the FIRST chunk
        elif event.type == "toolUse":             # yes — it can call your tools
            result = run_tool(event.name, event.input)
            await stream.input.send(tool_result(event.id, result))

# Barge-in: when VAD hears the caller start, stop playback + signal the model.

Notice two things. First, it still calls tools — a speech-to-speech model isn't a dead end for agentic behavior, it just wires tool use into the stream instead of a text loop. Second, everything is concurrent: you're sending and receiving audio at the same time, which is precisely what makes barge-in and low latency possible. If you're already deploying agents on Bedrock, this slots into the same world as the rest of your stack — we walk through that multi-cloud picture in deploying agents across Bedrock, Azure and GCP, and the human-approval side in HITL with LangGraph and Bedrock AgentCore.

So which do I actually reach for?

Start cascaded — you get transparency, easy tool/RAG/guardrail integration, and the freedom to mix vendors, which is what you want while your agent logic is still moving. Reach for a speech-to-speech model like Nova Sonic when the agent is stable and sub-500ms feel or natural prosody is the product (think consumer companions, high-volume phone support). Plenty of teams run both: cascaded for the complex, tool-heavy flows, speech-to-speech for the chatty front door.

Best practices that survive a live call

A voice demo needs a good model. A voice agent needs the boring stuff around it. Tick each one and watch the difference between something that wows in a meeting and something that holds up when a stranger calls it:

0% — a demo that skips any of these is the one that falls apart on a live call.

None of these are optional in production. The prompt keeps it listenable, streaming keeps it fast, barge-in keeps it human, confirmation keeps it correct, guardrails + handoff keep it safe, and logging keeps it improvable.
  • Always design a human handoff. A voice agent that can't escalate to a person is a trap for the caller. Make the escape hatch explicit and easy.
  • Meter cost differently. Audio models bill per minute or per character, not per token — a hundred concurrent calls is a very different bill and infrastructure profile than a hundred chat sessions.
  • Mind consent and PII. Voice is biometric data in some jurisdictions and call recording is regulated; spoken PII needs the same guardrails as typed PII.
  • Log the whole turn. Transcript, per-stage latency, and where barge-ins happened. “It felt slow” is only fixable if you can see which stage blew the budget.

Building one you can talk to, today

This is exactly how voice works in AgentSwarms, and the design follows the thesis of this whole post: it runs the cascaded loop for you, so a voice agent is just an ordinary agent with audio on the ends. In the Agent Builder, click New Voice Agent — you get the full builder plus a dedicated Voice tab for the voice, greeting, and speech models. Every spoken reply routes through the same engine as a chat agent, so your tools, RAG, memory, and guardrails apply unchanged. Nothing about your agent has to be rebuilt to give it a voice.

Then go listen to it. The Voice Playground lets you tap the mic and talk to your agent — or to a ready-made sample (a support-triage agent, a B2B discovery caller, a language tutor) that you can fork in one click. That listen-and-tune loop is where the prompting-for-the-ear rules above stop being abstract. When you're ready for the production and cloud-architecture depth — telephony, the full latency breakdown, and reference stacks for AWS, GCP and Azure — the Voice Agents deep-dive lesson and the voice docs go the rest of the way.

The trap at the start of this post — the agent that couldn't be interrupted — wasn't a model problem. The model was fine. It was everything around the model: no barge-in, no streaming, a chat prompt reading paragraphs at a person. Get those right and the magic that lasted four seconds lasts the whole call. Build the agent well, then give it a voice — in that order.


Was this useful?

Comments

Sign in to join the discussion.

Loading comments…