Build
Voice agents
A voice agent is an ordinary AgentSwarms agent wrapped in a speech loop: your microphone is transcribed to text, the agent replies with its full brain — tools, knowledge, memory, guardrails — and the reply is spoken back in a natural voice. This page covers how it works on the platform, how to build one, and how to take one to production on any major cloud.
How voice works in AgentSwarms
AgentSwarms uses the cascaded pipeline: three streaming stages — speech-to-text (STT), the LLM, and text-to-speech (TTS) — run in sequence and repeat once per conversational turn. This is transparent and debuggable, and it means a voice agent reuses the exact same agent engine as a chat agent; the audio is simply bolted onto both ends.
┌─────────┐ audio ┌──────────────┐ text ┌───────────────────┐
│ You │ ────────▶ │ Speech → │ ────────▶ │ The agent (LLM) │
│ speak │ │ Text (STT) │ message │ tools · RAG · │
└─────────┘ └──────────────┘ │ memory · guards │
▲ └─────────┬─────────┘
│ audio played back ┌──────────────┐ reply │ text
└───────────────────────── │ Text → │ ◀──────────┘
│ Speech (TTS)│
└──────────────┘The alternative — a single multimodal speech-to-speech model that ingests and emits audio directly (OpenAI Realtime, Gemini Live) — gives the lowest latency and most natural prosody, at the cost of debuggability and tool flexibility. AgentSwarms runs the cascaded loop so your tools, retrieval, and guardrails stay in the middle where you can inspect them; the production section below covers when to reach for realtime instead.
Creating a voice agent
On the Agents page, click New Voice Agent (the mic button next to New Agent). You get the full agent builder plus a dedicated Voice tab. The Knowledge, Tools, and Guardrails tabs are identical to a normal agent — anything you set there is enforced on every spoken reply.
- Opening greeting
- The first line the agent speaks. Because browsers block audio autoplay until the user interacts, the Voice Playground plays it on a tap (a “Play greeting” button) rather than automatically.
- System prompt
- The persona and rules. Write for the ear: short replies, no markdown or lists, paraphrase long IDs/URLs. A voice-channel instruction (be concise, no markdown) is appended automatically at runtime.
- Voice
- One of 11 natural voices (alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse).
- Chat model
- The LLM that generates the reply — e.g. google/gemini-3-flash-preview (fast, default) or a stronger model for harder conversations.
- Text-to-speech model
- gpt-4o-mini-tts (fast, default) or gpt-4o-tts (higher fidelity).
- Speech-to-text model
- gpt-4o-mini-transcribe (default) or gpt-4o-transcribe (higher accuracy).
- Temperature / max tokens
- Keep temperature moderate (≈0.5) and max tokens low (≈512). Spoken turns should be brief — brevity is a feature.
The Voice Playground
Open a voice agent from its Talk button, or go to Voice Playground in the sidebar. Tap the mic, speak, and tap the square when you're done. The status line walks the loop: Listening → Transcribing and thinking → Speaking. Each turn also appears as a text bubble so you can read the transcript.
- Samples — the playground ships ready-to-talk voice agents (a support-triage agent, a B2B discovery caller, a Spanish tutor). Open one and talk to it immediately, no setup.
- Fork — “Fork to my agents” copies a sample into your workspace as a real, editable agent so you can adapt its prompt, voice, and knowledge.
- Your voice agents — any agent you created with a voice config appears in the playground's gallery alongside the samples.
Tools, knowledge & guardrails carry over
Voice is a channel, not a downgrade. Every spoken reply from a saved agent routes through the same chat engine as text, so a linked knowledge base (RAG), enabled tools (web search, calculator, SQL, MCP…), long-term memory, and guardrails all apply unchanged. A guardrail that blocks an unsafe answer blocks it before it is ever spoken.
Beyond the Playground
The Voice Playground is a prototyping surface. Turning a voice agent into a product that answers a real phone number and handles many concurrent callers means adding a telephony layer and a strict latency budget on top of the same STT→LLM→TTS loop — streaming transcription, voice-activity detection, and barge-in so callers can interrupt. That's a deeper topic than this page, and it lives in the curriculum.