Technical

May 17, 2026 · 13 min read

Voice AI Agents: What They Cost and Why They Sound Robotic

Voice AI agents cost $200–$2,000/month at 500–5K interactions/day. Here's what drives the range and why cheap builds sound robotic.

Anil Gulecha

Ex-HackerRank, Ex-Google

Voice AI Agents: What They Cost and Why They Sound Robotic

TL;DR

Voice AI agent costs break into four layers: STT (speech recognition), LLM intent engine, TTS (text-to-speech), and infra. The TTS choice alone accounts for 60–70% of the quality gap.
At 1,000 interactions per day, a budget build runs $250–$300/month. A production-quality build runs $750–$900/month. The difference is almost entirely in TTS and streaming STT choices.
We built SARA on Deepgram Nova-2 + ElevenLabs TurboV2.5 + GPT-4o-mini. The beta used Google STT + Google TTS and was rejected by 3 of 5 testers before launch.
SaaS platforms (Bland.ai, Vapi.ai, Retell.ai) work for simple conversational flows at low volume. Custom builds win for speech-to-action agents with complex multi-system integrations.
The adoption threshold matters more than per-interaction cost. A $350/month voice agent nobody uses is more expensive than an $850/month one with 80% daily active usage.

On this page

We built SARA for an ops team running 50–100 voice commands a day. Their first question was predictable: “How much will this cost per month?”

Our answer: $800–$1,200/month, depending on volume. Their follow-up: “Could we do it cheaper?” Yes. And we tried. The $350/month version used Google TTS for the response voice. In beta testing with five users, three said they’d rather type. They didn’t explain why. They didn’t have to. It sounded like a phone tree.

That’s the voice AI cost problem in one sentence. The components that reduce the price are the components that make the agent sound like it was built in 2012.

This post breaks down the actual cost stack across four layers, explains which specific engineering choices produce robotic-sounding agents, and gives you a decision framework for SaaS platforms versus custom builds.

Three Types of Voice AI Agents (Category Determines Cost)

“Voice AI agent” means different things depending on who’s asking. The cost structures diverge significantly.

Conversational agents handle multi-turn dialogue: booking a table, answering FAQ, triaging a support ticket. Latency requirements are relaxed. Three to 4-second responses are fine because the user is already waiting for the next question. Bland.ai, Vapi.ai, and Retell.ai are SaaS platforms designed for this category. They’re the right starting point for most teams.

Speech-to-action agents like SARA handle single-command execution: “Add a note to Johnson account: invoice approved.” No conversation, no clarifying questions. The user gives a command and expects an action in under 2 seconds. SaaS platforms handle simple versions. Complex multi-system integrations almost always require a custom build. The full build story is in how we built SARA.

Voice analytics (transcribing and scoring calls) don’t need TTS at all. They’re pure input pipelines, paying for STT and LLM only with no output voice layer. Cost structure is completely different.

This post covers conversational and speech-to-action agents, since those are what most founders are actually evaluating.

The Cost Stack: Four Layers

Layer 1: STT (Speech Recognition)

This converts audio to text. Two dimensions matter: accuracy (word error rate) and latency (time to first token).

Provider	Per Minute	Latency	Notes
Deepgram Nova-2	$0.0043	150–280ms streaming	Our default. Lowest latency in this comparison.
Google Cloud STT	$0.016	350–500ms	Higher latency, comparable accuracy on standard English.
AWS Transcribe	$0.024	400–600ms	Good for AWS-native stacks.
OpenAI Whisper (self-hosted)	Infra cost only	Variable	Works, but streaming requires custom setup. GPU adds $300–600/mo.

At 1,000 interactions per day, assuming 30 seconds of audio per interaction:

Deepgram: ~$65/month
Google Cloud: ~$240/month
AWS Transcribe: ~$360/month

Deepgram is cheaper and faster. This is the rare case where the budget choice and the quality choice converge. The Deepgram streaming API lets you start processing partial transcripts before the user finishes speaking. This is critical for sub-2s latency. We covered the Deepgram pipeline setup in more detail in our Deepgram Python walkthrough.

The STT layer isn’t where the robotic problem lives.

Layer 2: Intent Engine (LLM)

This processes the transcript and decides what action to take. For speech-to-action systems, you need structured output (JSON action parameters, not free text). For conversational systems, you need more flexible generation.

Model	Per 1M Input Tokens	Per 1M Output Tokens	Notes
GPT-4o-mini	$0.15	$0.60	Our default. Best cost-performance for structured intent.
Claude 3 Haiku	$0.25	$1.25	Good alternative. Slightly better at certain intent schemas.
GPT-4o	$2.50	$10.00	Only needed for complex reasoning in the intent step.

At 1,000 interactions per day with ~500 input tokens and ~200 output tokens each:

GPT-4o-mini: ~$9/month
GPT-4o: ~$90/month

The intent layer is inexpensive either way. Not where the robotic problem lives, either.

Layer 3: TTS (Text-to-Speech)

This is where the robotic problem lives.

Provider	Per 1K Characters	Quality	Notes
ElevenLabs TurboV2.5	$0.18	Natural, near-human	Our default for SARA.
PlayHT	$0.12	Very good	Strong competitor to ElevenLabs.
OpenAI TTS-1	$0.015	Good	Better than Google, worse than ElevenLabs. 12× cheaper than ElevenLabs.
Azure Neural TTS	$0.016	Decent	Comparable to OpenAI TTS-1.
Google Cloud TTS	$0.004	Robotic	Accurate but flat. “The 2015 robot,” in one user’s words.

At 1,000 interactions per day, assuming 100-character responses:

ElevenLabs TurboV2.5: ~$540/month
OpenAI TTS-1: ~$45/month
Google Cloud TTS: ~$12/month

The gap between ElevenLabs and Google is $528/month. It’s almost entirely responsible for whether your agent sounds human or like a phone tree. The ElevenLabs TTS documentation covers latency benchmarks for their Turbo models. TurboV2.5 adds roughly 80–120ms to a response versus their standard model, which is an acceptable trade for streaming latency.

Layer 4: Infrastructure

WebSocket server, state management, audio streaming.

Setup	Monthly Cost	Notes
Fly.io 2×CPU/2GB	~$35	Works for low volume.
Fly.io 4×CPU/8GB	~$100	Our SARA production setup.
Redis (session state)	$20–60	Context, dedup, rate limiting.
Audio CDN / storage	$10–30	If you’re recording for compliance.

For most builds: $100–200/month in infrastructure.

Full Monthly Cost at 1,000 Interactions Per Day

30K interactions/month, 30-second audio inputs, 100-character responses:

Component	Budget Build	Budget Cost/Mo	Production Build	Production Cost/Mo
STT	Google Cloud STT	$240	Deepgram Nova-2	$65
LLM	GPT-3.5-turbo	$6	GPT-4o-mini	$9
TTS	Google Cloud TTS	$12	ElevenLabs TurboV2.5	$540
Infra	Shared server	$30	Dedicated WS + Redis	$130
Total		$288/mo		$744/mo

The production build is 2.6× the cost. The difference is almost entirely TTS. And the production build sounds like a person. The budget build sounds robotic.

One counterintuitive observation: the budget build uses Google Cloud STT ($240/month) which is actually more expensive than Deepgram ($65/month) while being slower. If you’re doing a budget build, at least use Deepgram for STT. It’ll save money and reduce latency.

Five Reasons Your Voice Agent Sounds Robotic

These are specific engineering decisions, not bad luck.

1. Google/Azure TTS at Default Settings

Neural TTS models trained on human speech capture prosody (rhythm, stress, intonation) in ways that concatenative and older synthesis systems don’t. Google Cloud TTS and standard Azure TTS are accurate but flat. ElevenLabs, PlayHT, and (more recently) OpenAI TTS-1 capture the natural rise and fall of speech.

The difference is immediately apparent to any listener. You don’t need an A/B test to hear it.

If your voice agent uses Google or Azure TTS with default voice settings, this single choice accounts for roughly 60% of the robotic perception. Switching to OpenAI TTS-1 ($0.015/1K chars vs $0.004) improves quality noticeably at 3.75× the cost, and it’s still cheap. Switching to ElevenLabs TurboV2.5 at $0.18/1K chars gives the best available quality at 45× the Google cost.

2. Batch STT Instead of Streaming STT

Most developers start with batch STT: record the full utterance, send to the API, get the transcript, process. Easier to build. Also slower by 1–2 seconds.

Streaming STT (Deepgram’s real-time endpoint sends partial transcripts as the user speaks) reduces latency to under 300ms end-to-end. The time-to-response drops from 3–5 seconds to under 1.5 seconds. Anything over 2.5 seconds breaks the interactive feel. This isn’t a perception problem, it’s a threshold effect. Under 2.5s feels like talking to someone. Over 2.5s feels like waiting on hold.

We switched SARA from batch to streaming midway through development. The refactor cost two sprint days and we’d have saved both if we’d started with streaming. The user feedback difference was immediate.

3. No Turn-Taking Design

When should your agent stop listening and start processing? Most basic implementations use voice activity detection (VAD) with a fixed end-of-speech threshold: 500ms of silence equals done talking. This works about 70% of the time. The other 30%, it either cuts off the user mid-sentence or waits too long after they’ve finished.

Production voice agents need configurable VAD: shorter silence threshold for command-style interactions (200ms), longer for conversational (600ms). And they need early termination when intent is clear: if the transcript already matches a high-confidence intent before the sentence ends, there’s no reason to wait.

We still don’t have a fully satisfying answer for ambiguous mid-sentence pauses. VAD remains one of the harder unsolved pieces.

4. Stateless Error Recovery

“I didn’t catch that. Could you repeat your request?” is a tone-breaker. It announces the system has failed and reveals that no partial understanding was captured.

Better patterns: confirm partial intent. “I heard ‘add note to Johnson.’ What note should I add?” Or, for very low-confidence transcripts, “Sorry, I lost you there. Go ahead.” Both acknowledge the failure without announcing system failure. Both require storing the partial intent from the transcript: a confidence threshold gate, not just passing raw transcripts to the LLM.

The difference between these two recovery paths is roughly 8 lines of code and significant perceived intelligence improvement.

5. No Session State

A voice agent with no memory of the current session resets after every command. The user says “add a note to Johnson account,” gets a confirmation. Then says “mark that account as contacted” and the agent has no idea which account, because it forgot the previous command.

Lightweight session state (Redis with a 10-minute TTL, keyed to session ID) costs nearly nothing and makes the agent feel significantly more intelligent. It’s not AI. It’s just memory. The absence of it is one of the most common reasons demo-quality voice agents fail in real usage.

The Adoption Threshold Matters More Than Per-Interaction Cost

We tried the budget build for SARA before committing to the production stack. The beta version used Google STT and Google TTS. Overall API cost: roughly $350/month.

In beta testing with five users over two weeks, three said they’d prefer to type. The specific complaint from two of them was the voice: “it sounds like a robot,” which is the Google TTS default telling you something. The third said the lag (batch STT, 4-second average response) was the problem.

If the ops team rejects the tool because of voice quality and latency, the monthly savings don’t matter. You’ve spent $15K on a build nobody uses.

We rebuilt with Deepgram Nova-2 and ElevenLabs TurboV2.5. Monthly cost went to $1,100/month. Daily active usage after launch: all eight users, within the first week. They didn’t consciously notice the better voice. They just stopped thinking about the tool and started using it. That’s the adoption threshold. When the interaction costs users no cognitive overhead, they adopt.

When SaaS Platforms Win

Use Bland.ai, Vapi.ai, or Retell.ai when:

You’re validating whether voice AI helps at all, before committing to a custom build
The use case fits standard conversational flows (appointment booking, FAQ, simple intake)
Volume is under 10,000 interactions per month
You don’t need complex multi-system integrations
Data residency isn’t a hard requirement

Build custom when:

Complex integrations required (SARA needed to write to three internal tools simultaneously, with different auth schemes)
Volume exceeds 30,000 interactions per month (at that point, SaaS per-minute pricing exceeds custom infra costs on most platforms)
Compliance requirements (some SaaS platforms route audio through US servers only, which matters for non-US regulated industries)
The latency floor of SaaS platforms is too high for your specific use case
You need custom VAD thresholds and turn-taking logic

For SARA, the decision was mostly integration complexity. The client’s three internal tools didn’t have out-of-the-box support on any SaaS platform, and the action schema was complex enough that standard conversational flows wouldn’t handle it cleanly.

What We’d Build Today

Starting a voice agent from scratch in mid-2026:

STT: Deepgram Nova-2, streaming mode. Real-time WebSocket connection; partial transcripts feed the intent engine as they arrive.

Intent: GPT-4o-mini with Pydantic-structured output for action classification. Confidence threshold (we use 0.82) gates the error recovery path: below it, we ask for confirmation rather than acting.

TTS: ElevenLabs TurboV2.5 for production quality. OpenAI TTS-1 if budget is the main constraint (noticeable quality drop, but still meaningfully better than Google).

Infra: Fly.io WebSocket server, Redis for session state (10-minute TTL per session ID).

VAD: Deepgram’s built-in endpointing. Configurable thresholds, and reasonable defaults. We set ours to 250ms for command-style interactions.

Estimated monthly cost at 500 interactions/day: $350–500/month. Estimated monthly cost at 5,000 interactions/day: $2,000–2,800/month.

One thing we’d do differently from the first SARA build: start with streaming STT from day one. We added it mid-development. The refactor cost two sprint days that didn’t need to happen.

FAQ

How much does a voice AI agent cost per month?

At 1,000 interactions per day, a budget build (Google Cloud STT + Google TTS) runs $250–$300/month in API costs, plus $100–150/month in infra. A production-quality build (Deepgram + ElevenLabs TurboV2.5) runs $700–$900/month. At 5,000 interactions per day, the production build runs $2,000–$2,800/month. The main cost variable is TTS, which scales linearly with interaction volume and response length.

What’s the difference between Bland.ai/Vapi.ai and a custom build?

SaaS platforms handle infrastructure and provide pre-built conversational flows. They’re faster to launch (days vs weeks) and work well for standard use cases at low volume. Custom builds make sense when you need complex multi-system integrations, have compliance or data residency requirements, or volume exceeds roughly 30,000 interactions per month, where SaaS per-minute pricing starts compounding above custom infra costs.

Why does my voice agent sound robotic?

Usually the TTS engine. Google Cloud TTS and basic Azure TTS produce accurate but flat speech. ElevenLabs TurboV2.5, PlayHT, and OpenAI TTS-1 capture natural prosody: the rhythm and stress patterns that make speech sound human. Switching from Google TTS to OpenAI TTS-1 (3.75× more expensive but still cheap at $0.015/1K chars) is a quick improvement. ElevenLabs at $0.18/1K chars gives the best available quality.

Can I use OpenAI TTS to save money versus ElevenLabs?

Yes, and it’s a reasonable trade-off for internal tools. OpenAI TTS-1 sounds noticeably better than Google/Azure and costs $0.015/1K chars versus ElevenLabs TurboV2.5 at $0.18/1K chars. For internal tools where users care more about speed and accuracy than naturalness, OpenAI TTS-1 works well. For customer-facing agents where voice quality affects whether users adopt the tool at all, ElevenLabs is usually worth the 12× premium.

What volume justifies building a custom voice agent instead of using a SaaS platform?

Roughly 10,000–15,000 interactions per month, assuming your use case also has integration complexity that SaaS platforms can’t handle off the shelf. Below that, a SaaS platform is faster to launch and cheaper to operate. Integration complexity is usually the more decisive factor than volume alone. If your use case requires multi-system writes that no SaaS platform supports, custom is the right call regardless of volume.

Evaluating whether a voice agent makes sense for your product or ops workflow? Book a 30-minute call and we’ll tell you honestly whether your use case needs a custom build or a SaaS platform will do the job.

#voice ai agent#voice ai#ai development cost#speech to text#deepgram#custom ai solution#voice agent

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

LinkedIn GitHub · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

RAG in Production: What It Actually Costs After Sprint 3

Technical

What Your AI Assistant Actually Costs in Production

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

Or email: dharini@kalviumlabs.ai

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Voice AI Agents: What They Cost and Why They Sound Robotic

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

Three Types of Voice AI Agents (Category Determines Cost)

The Cost Stack: Four Layers

Layer 1: STT (Speech Recognition)

Layer 2: Intent Engine (LLM)

Layer 3: TTS (Text-to-Speech)

Layer 4: Infrastructure

Full Monthly Cost at 1,000 Interactions Per Day

Five Reasons Your Voice Agent Sounds Robotic

1. Google/Azure TTS at Default Settings

2. Batch STT Instead of Streaming STT

3. No Turn-Taking Design

4. Stateless Error Recovery

5. No Session State

The Adoption Threshold Matters More Than Per-Interaction Cost

When SaaS Platforms Win

What We’d Build Today

FAQ

How much does a voice AI agent cost per month?

What’s the difference between Bland.ai/Vapi.ai and a custom build?

Why does my voice agent sound robotic?

Can I use OpenAI TTS to save money versus ElevenLabs?

What volume justifies building a custom voice agent instead of using a SaaS platform?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

RAG in Production: What It Actually Costs After Sprint 3

What Your AI Assistant Actually Costs in Production

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

Three Types of Voice AI Agents (Category Determines Cost)

The Cost Stack: Four Layers

Layer 1: STT (Speech Recognition)

Layer 2: Intent Engine (LLM)

Layer 3: TTS (Text-to-Speech)

Layer 4: Infrastructure

Full Monthly Cost at 1,000 Interactions Per Day

Five Reasons Your Voice Agent Sounds Robotic

1. Google/Azure TTS at Default Settings

2. Batch STT Instead of Streaming STT

3. No Turn-Taking Design

4. Stateless Error Recovery

5. No Session State

The Adoption Threshold Matters More Than Per-Interaction Cost

When SaaS Platforms Win

What We’d Build Today

FAQ

How much does a voice AI agent cost per month?

What’s the difference between Bland.ai/Vapi.ai and a custom build?

Why does my voice agent sound robotic?

Can I use OpenAI TTS to save money versus ElevenLabs?

What volume justifies building a custom voice agent instead of using a SaaS platform?

One engineering tradeoff, every Tuesday.

Anil Gulecha

Keep reading

RAG in Production: What It Actually Costs After Sprint 3

What Your AI Assistant Actually Costs in Production

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.