Case Studies

April 22, 2026 · 10 min read

Speech-to-Action: How We Built the SARA Speech Agent

How we built SARA, a voice AI agent converting speech to app actions in under 2 seconds. Real architecture, latency tradeoffs, and what broke.

Abraham Jeron

AI products & system architecture — from prototype to production

Speech-to-Action: How We Built the SARA Speech Agent

TL;DR

End-of-speech detection is the hardest part of a speech agent. Silence thresholds that work in a quiet office fail completely in a noisy one.
Deepgram's streaming API with interim transcripts cut perceived latency from 3.2 seconds to 1.1 seconds. Users see their words appear as they speak.
Function calling works well for intent classification up to about 30 action types. Above that, you need a two-stage routing step.
ElevenLabs sounded too real. Users trusted the product more with AWS Polly's slightly synthetic voice. Uncanny valley cuts both ways.
WebSocket audio on iOS Safari drops 8% of packets on cellular. A 200ms jitter buffer dropped garbled transcripts from 12% to under 2% on mobile.

On this page

The first time SARA executed a command correctly, it was an accident.

We’d been fighting with end-of-speech detection for two days. The system kept cutting users off mid-sentence. On one test run, I said “SARA, add a follow-up reminder for the Johnson account” and then sneezed. The system detected the sneeze as end-of-speech, classified the partial utterance, and somehow created the right reminder anyway. We laughed. Then we quietly documented exactly which silence threshold was active at that moment, because whatever we’d accidentally set was working better than everything we’d tried intentionally.

SARA is a speech-to-action agent we built as a custom AI solution for an enterprise client. Their operations team spends about 40% of the day on data entry and task creation across three internal tools. They wanted voice commands for the repetitive parts. “Add note to customer X.” “Mark task Y complete.” “Schedule follow-up for next Tuesday.” Simple instructions, but typed out 50-100 times a day by eight people.

This is the build story, including the parts that took longer than we’d scoped.

What SARA Actually Does

SARA listens, converts what it hears into an intent with parameters, executes the corresponding action against the client’s internal APIs, and confirms with a spoken response.

The full flow: the user speaks a command, the browser streams audio over WebSocket to our backend, Deepgram’s streaming API converts speech to text in real time, an LLM classifies the intent and extracts parameters using function calling, the backend calls the appropriate internal API, and the system plays a confirmation back to the user.

Simple on paper. The client’s acceptable threshold for the full voice-to-action loop was 2 seconds. We hit 1.7 seconds on desktop under good conditions and 2.4 seconds on mobile over cellular. Getting anywhere near those numbers took about three weeks of optimization we hadn’t originally planned for.

The Architecture

Audio capture runs in the browser using the Web Audio API, sampled at 16kHz, chunked into 250ms frames. We chose 16kHz because Deepgram’s Nova-2 model is trained on it. Upsampling from 8kHz doesn’t recover accuracy, and 44kHz adds unnecessary data with no benefit. Audio chunks go to a FastAPI backend over a persistent WebSocket connection.

The WebSocket approach was non-negotiable for latency. HTTP POST for audio means waiting until the user finishes speaking, then uploading the entire recording, then waiting for transcription. That’s 1.5-3 seconds before intent classification even starts. With WebSocket streaming, Deepgram produces interim transcripts while the user is still talking.

Those interim transcripts are what made 1.1-second perceived latency possible. The user sees their words appear on screen as they speak. Psychologically, the system feels responsive before any action has been taken. When they stop speaking, the final transcript triggers intent classification, and the action executes quickly after. The “thinking” has already been happening in parallel.

Intent classification uses Claude with function calling. We defined 22 action types as functions with their parameter schemas. The model receives the final transcript and returns a function call with extracted parameters. For “add follow-up for the Johnson account next Tuesday,” it returns create_task with account: "Johnson", type: "follow-up", date: "2026-04-22" (resolved from “next Tuesday” with the user’s current date and timezone passed as context).

The function execution calls the client’s internal APIs directly. Confirmation audio uses AWS Polly, which was a late change I’ll explain below.

The Hard Parts

End-of-speech detection. This consumed a week. Too short a silence threshold and the system cuts users off mid-thought. Too long and the loop feels sluggish. We landed on 600ms of silence as the default, with a dynamic adjustment based on the user’s speaking pace in the previous utterance. Someone who speaks quickly gets a 400ms threshold. Someone who pauses between words gets 800ms. It still misses about 6% of utterances by cutting too early, down from 23% before the dynamic adjustment.

Mobile audio reliability. Desktop Chrome over WiFi is stable. iOS Safari on cellular drops about 8% of audio chunks. A dropped chunk mid-word produces garbled transcript output, and Deepgram can’t recover from corrupted data. We added a 200ms jitter buffer and chunk sequence numbering so the backend can detect and request retransmission of dropped chunks. This added latency but dropped garbled transcripts from 12% of mobile sessions to under 2%.

Relative date parsing. “Next Tuesday,” “end of week,” “in two days,” “Thursday afternoon.” The LLM handles most of these correctly, but relative dates depend on knowing the user’s current date and timezone. We had a bug for two weeks where the system resolved “next Tuesday” in UTC rather than the user’s local timezone, which created tasks due at 4 AM local time. The fix was simple: pass the user’s current timestamp and timezone offset as part of every classification prompt. We just hadn’t thought to do it initially.

What We Tried That Didn’t Work

ElevenLabs for confirmation audio. The voice was indistinguishable from a real person. Users loved it in demos. In production, about 40% of users raised discomfort in feedback sessions. They said things like “it sounds too real” and “I can’t tell if I’m talking to a person.” One user said she felt “tricked.” We switched to AWS Polly, which has a slightly synthetic quality. Users found it more trustworthy for a business tool. The lesson: uncanny valley cuts both ways, and for a productivity tool your users interact with 100 times a day, “clearly AI” is the right call.

Streaming intent classification. We tried starting intent classification before the utterance finished, using interim transcripts to get a head start on execution. In theory, if you can predict intent from the first few words, you can execute before the user stops speaking. In practice, partial transcripts are too ambiguous. “Add a follow-up” could be create_task or create_reminder or log_call depending on what comes after. We abandoned this after two days.

Client-side noise cancellation. We tried running noise reduction in the browser using RNNoise compiled to WebAssembly. It worked for background noise but added 80ms of latency and introduced audio artifacts on plosive consonants (hard “p” and “b” sounds). Deepgram’s built-in noise handling was sufficient for office environments, so we removed it.

Where We Landed

After three months in production with the eight-person operations team:

Average voice-to-action latency: 1.7 seconds (desktop), 2.4 seconds (mobile)
End-of-speech accuracy: 94% of utterances correctly detected without cutoffs
Intent classification accuracy: 91% across 22 action types
Actions requiring correction (user had to repeat or retype): 8.2%
Self-reported time savings: 45 minutes per person per day on average

The 45-minute number surprised me. I expected voice commands to save time on individual interactions. What I hadn’t expected was the reduction in context switching. When you can add a note without touching the keyboard, you stay focused on the conversation you’re in rather than tabbing between tools. The productivity gain is less about the seconds saved per command and more about the attention preserved.

When a Custom Speech Agent Is the Right Call

Off-the-shelf voice assistants (Siri, Alexa, Google Assistant) can’t authenticate against internal APIs, maintain workflow context across a session, or resolve domain-specific vocabulary like customer names and internal task categories. They’re designed for consumer queries, not business operations.

If you have fewer than 10 action types and they’re all straightforward commands, you might be able to cover the use case with a simpler voice-enabled chatbot and a basic Whisper-based transcription setup. The full SARA architecture earns its complexity above that threshold, when you need sub-2-second latency, multi-turn session context, and integration with proprietary systems.

We’ve used similar approaches in our call analyzer work, where real-time audio processing and accurate transcription were the core engineering challenge.

The architecture overlaps more than you’d expect with standard agent pipelines (see our AI agent architecture post for how tool-calling and error recovery patterns transfer directly). Speech agents are just agents with an audio front-end.

One honest limitation we still don’t have a great answer for: speaker verification. Right now, anyone physically near the microphone can trigger actions. For the current client, that’s fine because it’s a small team in a controlled space. For a use case where different users have different permissions, you’d need speaker identification layered in, and that adds significant complexity (and latency) to the pipeline.

If you’re evaluating whether voice interfaces make sense for your team’s workflow, we can prototype the core loop (speech in, action out) in 48-72 hours. Book a 30-minute call and bring your list of target actions.

FAQ

How much does it cost to build a custom speech agent?

The cost depends on the number of action types, integration complexity, and latency requirements. A focused voice command layer with 10-20 action types and integration into 1-2 internal systems typically takes 3-6 weeks of engineering time. For larger scope (30+ actions, multiple system integrations, mobile optimization), expect 2-3 months. We can scope this accurately in a 30-minute call.

Can speech agents work in noisy open-plan offices?

Yes, with caveats. Deepgram’s Nova-2 handles typical office background noise well. Very loud environments (contact center floors, manufacturing floors) usually need push-to-talk rather than always-on listening, which changes the UX significantly. For SARA, we tested in a quiet office and a standard coworking space. Both worked fine. A contact center floor environment caused too many false end-of-speech triggers.

What languages does Deepgram support at production quality?

Deepgram Nova-2 is optimized for English and delivers production-quality accuracy. Their Whisper-based models support 30+ languages but with higher latency and lower accuracy than Nova-2 on English. If you need non-English speech recognition at production latency, check Deepgram’s language support page and plan for a higher error rate than what I’ve cited here.

How do you handle misrecognized commands?

SARA returns a confidence score with every intent classification. Commands below 0.7 confidence trigger a spoken clarification request (“Did you mean to add a follow-up for Johnson?”). Above 0.85, it executes and confirms. Between 0.7 and 0.85, it executes but logs the command for review. The client’s team reviews the mid-confidence log weekly and uses it to identify which phrasings are causing systematic misclassification, which we address in prompt updates.

Do you need a fine-tuned model or is a general LLM enough?

We haven’t needed custom model training for speech agent intent classification. Claude with a well-structured function schema and 20-30 example utterances per action type handles 91%+ accuracy across 22 action types. If you had 200+ action types, a fine-tuned classifier for the first routing stage would make more sense. For most business use cases, prompt engineering gets you there without the training infrastructure.

If your team is spending significant time on repetitive data entry that follows predictable patterns, voice commands are often the lowest-friction way to eliminate it. Book a 30-minute call and we’ll walk through whether the use case fits the architecture.

#custom ai solution#speech agent#voice ai#speech to text#deepgram#real-time ai#ai agent

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

LinkedIn · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Case Studies

How We Built an Advanced RAG System for Documents

Case Studies

How We Built a Talent Matching Platform with AI

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

Or email: dharini@kalviumlabs.ai

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Speech-to-Action: How We Built the SARA Speech Agent

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

What SARA Actually Does

The Architecture

The Hard Parts

What We Tried That Didn’t Work

Where We Landed

When a Custom Speech Agent Is the Right Call

FAQ

How much does it cost to build a custom speech agent?

Can speech agents work in noisy open-plan offices?

What languages does Deepgram support at production quality?

How do you handle misrecognized commands?

Do you need a fine-tuned model or is a general LLM enough?

One engineering tradeoff, every Tuesday.

Abraham Jeron

Keep reading

How We Built an Advanced RAG System for Documents

How We Built a Talent Matching Platform with AI

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

What SARA Actually Does

The Architecture

The Hard Parts

What We Tried That Didn’t Work

Where We Landed

When a Custom Speech Agent Is the Right Call

FAQ

How much does it cost to build a custom speech agent?

Can speech agents work in noisy open-plan offices?

What languages does Deepgram support at production quality?

How do you handle misrecognized commands?

Do you need a fine-tuned model or is a general LLM enough?

One engineering tradeoff, every Tuesday.

Abraham Jeron

Keep reading

How We Built an Advanced RAG System for Documents

How We Built a Talent Matching Platform with AI

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.