We built SARA for an ops team running 50–100 voice commands a day. Their first question was predictable: “How much will this cost per month?”
Our answer: $800–$1,200/month, depending on volume. Their follow-up: “Could we do it cheaper?” Yes. And we tried. The $350/month version used Google TTS for the response voice. In beta testing with five users, three said they’d rather type. They didn’t explain why. They didn’t have to. It sounded like a phone tree.
That’s the voice AI cost problem in one sentence. The components that reduce the price are the components that make the agent sound like it was built in 2012.
This post breaks down the actual cost stack across four layers, explains which specific engineering choices produce robotic-sounding agents, and gives you a decision framework for SaaS platforms versus custom builds.
Three Types of Voice AI Agents (Category Determines Cost)
“Voice AI agent” means different things depending on who’s asking. The cost structures diverge significantly.
Conversational agents handle multi-turn dialogue: booking a table, answering FAQ, triaging a support ticket. Latency requirements are relaxed. Three to 4-second responses are fine because the user is already waiting for the next question. Bland.ai, Vapi.ai, and Retell.ai are SaaS platforms designed for this category. They’re the right starting point for most teams.
Speech-to-action agents like SARA handle single-command execution: “Add a note to Johnson account: invoice approved.” No conversation, no clarifying questions. The user gives a command and expects an action in under 2 seconds. SaaS platforms handle simple versions. Complex multi-system integrations almost always require a custom build. The full build story is in how we built SARA.
Voice analytics (transcribing and scoring calls) don’t need TTS at all. They’re pure input pipelines, paying for STT and LLM only with no output voice layer. Cost structure is completely different.
This post covers conversational and speech-to-action agents, since those are what most founders are actually evaluating.
The Cost Stack: Four Layers
Layer 1: STT (Speech Recognition)
This converts audio to text. Two dimensions matter: accuracy (word error rate) and latency (time to first token).
| Provider | Per Minute | Latency | Notes |
|---|---|---|---|
| Deepgram Nova-2 | $0.0043 | 150–280ms streaming | Our default. Lowest latency in this comparison. |
| Google Cloud STT | $0.016 | 350–500ms | Higher latency, comparable accuracy on standard English. |
| AWS Transcribe | $0.024 | 400–600ms | Good for AWS-native stacks. |
| OpenAI Whisper (self-hosted) | Infra cost only | Variable | Works, but streaming requires custom setup. GPU adds $300–600/mo. |
At 1,000 interactions per day, assuming 30 seconds of audio per interaction:
- Deepgram: ~$65/month
- Google Cloud: ~$240/month
- AWS Transcribe: ~$360/month
Deepgram is cheaper and faster. This is the rare case where the budget choice and the quality choice converge. The Deepgram streaming API lets you start processing partial transcripts before the user finishes speaking. This is critical for sub-2s latency. We covered the Deepgram pipeline setup in more detail in our Deepgram Python walkthrough.
The STT layer isn’t where the robotic problem lives.
Layer 2: Intent Engine (LLM)
This processes the transcript and decides what action to take. For speech-to-action systems, you need structured output (JSON action parameters, not free text). For conversational systems, you need more flexible generation.
| Model | Per 1M Input Tokens | Per 1M Output Tokens | Notes |
|---|---|---|---|
| GPT-4o-mini | $0.15 | $0.60 | Our default. Best cost-performance for structured intent. |
| Claude 3 Haiku | $0.25 | $1.25 | Good alternative. Slightly better at certain intent schemas. |
| GPT-4o | $2.50 | $10.00 | Only needed for complex reasoning in the intent step. |
At 1,000 interactions per day with ~500 input tokens and ~200 output tokens each:
- GPT-4o-mini: ~$9/month
- GPT-4o: ~$90/month
The intent layer is inexpensive either way. Not where the robotic problem lives, either.
Layer 3: TTS (Text-to-Speech)
This is where the robotic problem lives.
| Provider | Per 1K Characters | Quality | Notes |
|---|---|---|---|
| ElevenLabs TurboV2.5 | $0.18 | Natural, near-human | Our default for SARA. |
| PlayHT | $0.12 | Very good | Strong competitor to ElevenLabs. |
| OpenAI TTS-1 | $0.015 | Good | Better than Google, worse than ElevenLabs. 12× cheaper than ElevenLabs. |
| Azure Neural TTS | $0.016 | Decent | Comparable to OpenAI TTS-1. |
| Google Cloud TTS | $0.004 | Robotic | Accurate but flat. “The 2015 robot,” in one user’s words. |
At 1,000 interactions per day, assuming 100-character responses:
- ElevenLabs TurboV2.5: ~$540/month
- OpenAI TTS-1: ~$45/month
- Google Cloud TTS: ~$12/month
The gap between ElevenLabs and Google is $528/month. It’s almost entirely responsible for whether your agent sounds human or like a phone tree. The ElevenLabs TTS documentation covers latency benchmarks for their Turbo models. TurboV2.5 adds roughly 80–120ms to a response versus their standard model, which is an acceptable trade for streaming latency.
Layer 4: Infrastructure
WebSocket server, state management, audio streaming.
| Setup | Monthly Cost | Notes |
|---|---|---|
| Fly.io 2×CPU/2GB | ~$35 | Works for low volume. |
| Fly.io 4×CPU/8GB | ~$100 | Our SARA production setup. |
| Redis (session state) | $20–60 | Context, dedup, rate limiting. |
| Audio CDN / storage | $10–30 | If you’re recording for compliance. |
For most builds: $100–200/month in infrastructure.
Full Monthly Cost at 1,000 Interactions Per Day
30K interactions/month, 30-second audio inputs, 100-character responses:
| Component | Budget Build | Budget Cost/Mo | Production Build | Production Cost/Mo |
|---|---|---|---|---|
| STT | Google Cloud STT | $240 | Deepgram Nova-2 | $65 |
| LLM | GPT-3.5-turbo | $6 | GPT-4o-mini | $9 |
| TTS | Google Cloud TTS | $12 | ElevenLabs TurboV2.5 | $540 |
| Infra | Shared server | $30 | Dedicated WS + Redis | $130 |
| Total | $288/mo | $744/mo |
The production build is 2.6× the cost. The difference is almost entirely TTS. And the production build sounds like a person. The budget build sounds robotic.
One counterintuitive observation: the budget build uses Google Cloud STT ($240/month) which is actually more expensive than Deepgram ($65/month) while being slower. If you’re doing a budget build, at least use Deepgram for STT. It’ll save money and reduce latency.
Five Reasons Your Voice Agent Sounds Robotic
These are specific engineering decisions, not bad luck.
1. Google/Azure TTS at Default Settings
Neural TTS models trained on human speech capture prosody (rhythm, stress, intonation) in ways that concatenative and older synthesis systems don’t. Google Cloud TTS and standard Azure TTS are accurate but flat. ElevenLabs, PlayHT, and (more recently) OpenAI TTS-1 capture the natural rise and fall of speech.
The difference is immediately apparent to any listener. You don’t need an A/B test to hear it.
If your voice agent uses Google or Azure TTS with default voice settings, this single choice accounts for roughly 60% of the robotic perception. Switching to OpenAI TTS-1 ($0.015/1K chars vs $0.004) improves quality noticeably at 3.75× the cost, and it’s still cheap. Switching to ElevenLabs TurboV2.5 at $0.18/1K chars gives the best available quality at 45× the Google cost.
2. Batch STT Instead of Streaming STT
Most developers start with batch STT: record the full utterance, send to the API, get the transcript, process. Easier to build. Also slower by 1–2 seconds.
Streaming STT (Deepgram’s real-time endpoint sends partial transcripts as the user speaks) reduces latency to under 300ms end-to-end. The time-to-response drops from 3–5 seconds to under 1.5 seconds. Anything over 2.5 seconds breaks the interactive feel. This isn’t a perception problem, it’s a threshold effect. Under 2.5s feels like talking to someone. Over 2.5s feels like waiting on hold.
We switched SARA from batch to streaming midway through development. The refactor cost two sprint days and we’d have saved both if we’d started with streaming. The user feedback difference was immediate.
3. No Turn-Taking Design
When should your agent stop listening and start processing? Most basic implementations use voice activity detection (VAD) with a fixed end-of-speech threshold: 500ms of silence equals done talking. This works about 70% of the time. The other 30%, it either cuts off the user mid-sentence or waits too long after they’ve finished.
Production voice agents need configurable VAD: shorter silence threshold for command-style interactions (200ms), longer for conversational (600ms). And they need early termination when intent is clear: if the transcript already matches a high-confidence intent before the sentence ends, there’s no reason to wait.
We still don’t have a fully satisfying answer for ambiguous mid-sentence pauses. VAD remains one of the harder unsolved pieces.
4. Stateless Error Recovery
“I didn’t catch that. Could you repeat your request?” is a tone-breaker. It announces the system has failed and reveals that no partial understanding was captured.
Better patterns: confirm partial intent. “I heard ‘add note to Johnson.’ What note should I add?” Or, for very low-confidence transcripts, “Sorry, I lost you there. Go ahead.” Both acknowledge the failure without announcing system failure. Both require storing the partial intent from the transcript: a confidence threshold gate, not just passing raw transcripts to the LLM.
The difference between these two recovery paths is roughly 8 lines of code and significant perceived intelligence improvement.
5. No Session State
A voice agent with no memory of the current session resets after every command. The user says “add a note to Johnson account,” gets a confirmation. Then says “mark that account as contacted” and the agent has no idea which account, because it forgot the previous command.
Lightweight session state (Redis with a 10-minute TTL, keyed to session ID) costs nearly nothing and makes the agent feel significantly more intelligent. It’s not AI. It’s just memory. The absence of it is one of the most common reasons demo-quality voice agents fail in real usage.
The Adoption Threshold Matters More Than Per-Interaction Cost
We tried the budget build for SARA before committing to the production stack. The beta version used Google STT and Google TTS. Overall API cost: roughly $350/month.
In beta testing with five users over two weeks, three said they’d prefer to type. The specific complaint from two of them was the voice: “it sounds like a robot,” which is the Google TTS default telling you something. The third said the lag (batch STT, 4-second average response) was the problem.
If the ops team rejects the tool because of voice quality and latency, the monthly savings don’t matter. You’ve spent $15K on a build nobody uses.
We rebuilt with Deepgram Nova-2 and ElevenLabs TurboV2.5. Monthly cost went to $1,100/month. Daily active usage after launch: all eight users, within the first week. They didn’t consciously notice the better voice. They just stopped thinking about the tool and started using it. That’s the adoption threshold. When the interaction costs users no cognitive overhead, they adopt.
When SaaS Platforms Win
Use Bland.ai, Vapi.ai, or Retell.ai when:
- You’re validating whether voice AI helps at all, before committing to a custom build
- The use case fits standard conversational flows (appointment booking, FAQ, simple intake)
- Volume is under 10,000 interactions per month
- You don’t need complex multi-system integrations
- Data residency isn’t a hard requirement
Build custom when:
- Complex integrations required (SARA needed to write to three internal tools simultaneously, with different auth schemes)
- Volume exceeds 30,000 interactions per month (at that point, SaaS per-minute pricing exceeds custom infra costs on most platforms)
- Compliance requirements (some SaaS platforms route audio through US servers only, which matters for non-US regulated industries)
- The latency floor of SaaS platforms is too high for your specific use case
- You need custom VAD thresholds and turn-taking logic
For SARA, the decision was mostly integration complexity. The client’s three internal tools didn’t have out-of-the-box support on any SaaS platform, and the action schema was complex enough that standard conversational flows wouldn’t handle it cleanly.
What We’d Build Today
Starting a voice agent from scratch in mid-2026:
STT: Deepgram Nova-2, streaming mode. Real-time WebSocket connection; partial transcripts feed the intent engine as they arrive.
Intent: GPT-4o-mini with Pydantic-structured output for action classification. Confidence threshold (we use 0.82) gates the error recovery path: below it, we ask for confirmation rather than acting.
TTS: ElevenLabs TurboV2.5 for production quality. OpenAI TTS-1 if budget is the main constraint (noticeable quality drop, but still meaningfully better than Google).
Infra: Fly.io WebSocket server, Redis for session state (10-minute TTL per session ID).
VAD: Deepgram’s built-in endpointing. Configurable thresholds, and reasonable defaults. We set ours to 250ms for command-style interactions.
Estimated monthly cost at 500 interactions/day: $350–500/month. Estimated monthly cost at 5,000 interactions/day: $2,000–2,800/month.
One thing we’d do differently from the first SARA build: start with streaming STT from day one. We added it mid-development. The refactor cost two sprint days that didn’t need to happen.
FAQ
How much does a voice AI agent cost per month?
At 1,000 interactions per day, a budget build (Google Cloud STT + Google TTS) runs $250–$300/month in API costs, plus $100–150/month in infra. A production-quality build (Deepgram + ElevenLabs TurboV2.5) runs $700–$900/month. At 5,000 interactions per day, the production build runs $2,000–$2,800/month. The main cost variable is TTS, which scales linearly with interaction volume and response length.
What’s the difference between Bland.ai/Vapi.ai and a custom build?
SaaS platforms handle infrastructure and provide pre-built conversational flows. They’re faster to launch (days vs weeks) and work well for standard use cases at low volume. Custom builds make sense when you need complex multi-system integrations, have compliance or data residency requirements, or volume exceeds roughly 30,000 interactions per month, where SaaS per-minute pricing starts compounding above custom infra costs.
Why does my voice agent sound robotic?
Usually the TTS engine. Google Cloud TTS and basic Azure TTS produce accurate but flat speech. ElevenLabs TurboV2.5, PlayHT, and OpenAI TTS-1 capture natural prosody: the rhythm and stress patterns that make speech sound human. Switching from Google TTS to OpenAI TTS-1 (3.75× more expensive but still cheap at $0.015/1K chars) is a quick improvement. ElevenLabs at $0.18/1K chars gives the best available quality.
Can I use OpenAI TTS to save money versus ElevenLabs?
Yes, and it’s a reasonable trade-off for internal tools. OpenAI TTS-1 sounds noticeably better than Google/Azure and costs $0.015/1K chars versus ElevenLabs TurboV2.5 at $0.18/1K chars. For internal tools where users care more about speed and accuracy than naturalness, OpenAI TTS-1 works well. For customer-facing agents where voice quality affects whether users adopt the tool at all, ElevenLabs is usually worth the 12× premium.
What volume justifies building a custom voice agent instead of using a SaaS platform?
Roughly 10,000–15,000 interactions per month, assuming your use case also has integration complexity that SaaS platforms can’t handle off the shelf. Below that, a SaaS platform is faster to launch and cheaper to operate. Integration complexity is usually the more decisive factor than volume alone. If your use case requires multi-system writes that no SaaS platform supports, custom is the right call regardless of volume.
Evaluating whether a voice agent makes sense for your product or ops workflow? Book a 30-minute call and we’ll tell you honestly whether your use case needs a custom build or a SaaS platform will do the job.