The first time SARA executed a command correctly, it was an accident.
We’d been fighting with end-of-speech detection for two days. The system kept cutting users off mid-sentence. On one test run, I said “SARA, add a follow-up reminder for the Johnson account” and then sneezed. The system detected the sneeze as end-of-speech, classified the partial utterance, and somehow created the right reminder anyway. We laughed. Then we quietly documented exactly which silence threshold was active at that moment, because whatever we’d accidentally set was working better than everything we’d tried intentionally.
SARA is a speech-to-action agent we built as a custom AI solution for an enterprise client. Their operations team spends about 40% of the day on data entry and task creation across three internal tools. They wanted voice commands for the repetitive parts. “Add note to customer X.” “Mark task Y complete.” “Schedule follow-up for next Tuesday.” Simple instructions, but typed out 50-100 times a day by eight people.
This is the build story, including the parts that took longer than we’d scoped.
What SARA Actually Does
SARA listens, converts what it hears into an intent with parameters, executes the corresponding action against the client’s internal APIs, and confirms with a spoken response.
The full flow: the user speaks a command, the browser streams audio over WebSocket to our backend, Deepgram’s streaming API converts speech to text in real time, an LLM classifies the intent and extracts parameters using function calling, the backend calls the appropriate internal API, and the system plays a confirmation back to the user.
Simple on paper. The client’s acceptable threshold for the full voice-to-action loop was 2 seconds. We hit 1.7 seconds on desktop under good conditions and 2.4 seconds on mobile over cellular. Getting anywhere near those numbers took about three weeks of optimization we hadn’t originally planned for.
The Architecture
Audio capture runs in the browser using the Web Audio API, sampled at 16kHz, chunked into 250ms frames. We chose 16kHz because Deepgram’s Nova-2 model is trained on it. Upsampling from 8kHz doesn’t recover accuracy, and 44kHz adds unnecessary data with no benefit. Audio chunks go to a FastAPI backend over a persistent WebSocket connection.
The WebSocket approach was non-negotiable for latency. HTTP POST for audio means waiting until the user finishes speaking, then uploading the entire recording, then waiting for transcription. That’s 1.5-3 seconds before intent classification even starts. With WebSocket streaming, Deepgram produces interim transcripts while the user is still talking.
Those interim transcripts are what made 1.1-second perceived latency possible. The user sees their words appear on screen as they speak. Psychologically, the system feels responsive before any action has been taken. When they stop speaking, the final transcript triggers intent classification, and the action executes quickly after. The “thinking” has already been happening in parallel.
Intent classification uses Claude with function calling. We defined 22 action types as functions with their parameter schemas. The model receives the final transcript and returns a function call with extracted parameters. For “add follow-up for the Johnson account next Tuesday,” it returns create_task with account: "Johnson", type: "follow-up", date: "2026-04-22" (resolved from “next Tuesday” with the user’s current date and timezone passed as context).
The function execution calls the client’s internal APIs directly. Confirmation audio uses AWS Polly, which was a late change I’ll explain below.
The Hard Parts
End-of-speech detection. This consumed a week. Too short a silence threshold and the system cuts users off mid-thought. Too long and the loop feels sluggish. We landed on 600ms of silence as the default, with a dynamic adjustment based on the user’s speaking pace in the previous utterance. Someone who speaks quickly gets a 400ms threshold. Someone who pauses between words gets 800ms. It still misses about 6% of utterances by cutting too early, down from 23% before the dynamic adjustment.
Mobile audio reliability. Desktop Chrome over WiFi is stable. iOS Safari on cellular drops about 8% of audio chunks. A dropped chunk mid-word produces garbled transcript output, and Deepgram can’t recover from corrupted data. We added a 200ms jitter buffer and chunk sequence numbering so the backend can detect and request retransmission of dropped chunks. This added latency but dropped garbled transcripts from 12% of mobile sessions to under 2%.
Relative date parsing. “Next Tuesday,” “end of week,” “in two days,” “Thursday afternoon.” The LLM handles most of these correctly, but relative dates depend on knowing the user’s current date and timezone. We had a bug for two weeks where the system resolved “next Tuesday” in UTC rather than the user’s local timezone, which created tasks due at 4 AM local time. The fix was simple: pass the user’s current timestamp and timezone offset as part of every classification prompt. We just hadn’t thought to do it initially.
What We Tried That Didn’t Work
ElevenLabs for confirmation audio. The voice was indistinguishable from a real person. Users loved it in demos. In production, about 40% of users raised discomfort in feedback sessions. They said things like “it sounds too real” and “I can’t tell if I’m talking to a person.” One user said she felt “tricked.” We switched to AWS Polly, which has a slightly synthetic quality. Users found it more trustworthy for a business tool. The lesson: uncanny valley cuts both ways, and for a productivity tool your users interact with 100 times a day, “clearly AI” is the right call.
Streaming intent classification. We tried starting intent classification before the utterance finished, using interim transcripts to get a head start on execution. In theory, if you can predict intent from the first few words, you can execute before the user stops speaking. In practice, partial transcripts are too ambiguous. “Add a follow-up” could be create_task or create_reminder or log_call depending on what comes after. We abandoned this after two days.
Client-side noise cancellation. We tried running noise reduction in the browser using RNNoise compiled to WebAssembly. It worked for background noise but added 80ms of latency and introduced audio artifacts on plosive consonants (hard “p” and “b” sounds). Deepgram’s built-in noise handling was sufficient for office environments, so we removed it.
Where We Landed
After three months in production with the eight-person operations team:
- Average voice-to-action latency: 1.7 seconds (desktop), 2.4 seconds (mobile)
- End-of-speech accuracy: 94% of utterances correctly detected without cutoffs
- Intent classification accuracy: 91% across 22 action types
- Actions requiring correction (user had to repeat or retype): 8.2%
- Self-reported time savings: 45 minutes per person per day on average
The 45-minute number surprised me. I expected voice commands to save time on individual interactions. What I hadn’t expected was the reduction in context switching. When you can add a note without touching the keyboard, you stay focused on the conversation you’re in rather than tabbing between tools. The productivity gain is less about the seconds saved per command and more about the attention preserved.
When a Custom Speech Agent Is the Right Call
Off-the-shelf voice assistants (Siri, Alexa, Google Assistant) can’t authenticate against internal APIs, maintain workflow context across a session, or resolve domain-specific vocabulary like customer names and internal task categories. They’re designed for consumer queries, not business operations.
If you have fewer than 10 action types and they’re all straightforward commands, you might be able to cover the use case with a simpler voice-enabled chatbot and a basic Whisper-based transcription setup. The full SARA architecture earns its complexity above that threshold, when you need sub-2-second latency, multi-turn session context, and integration with proprietary systems.
We’ve used similar approaches in our call analyzer work, where real-time audio processing and accurate transcription were the core engineering challenge.
The architecture overlaps more than you’d expect with standard agent pipelines (see our AI agent architecture post for how tool-calling and error recovery patterns transfer directly). Speech agents are just agents with an audio front-end.
One honest limitation we still don’t have a great answer for: speaker verification. Right now, anyone physically near the microphone can trigger actions. For the current client, that’s fine because it’s a small team in a controlled space. For a use case where different users have different permissions, you’d need speaker identification layered in, and that adds significant complexity (and latency) to the pipeline.
If you’re evaluating whether voice interfaces make sense for your team’s workflow, we can prototype the core loop (speech in, action out) in 48-72 hours. Book a 30-minute call and bring your list of target actions.
FAQ
How much does it cost to build a custom speech agent?
The cost depends on the number of action types, integration complexity, and latency requirements. A focused voice command layer with 10-20 action types and integration into 1-2 internal systems typically takes 3-6 weeks of engineering time. For larger scope (30+ actions, multiple system integrations, mobile optimization), expect 2-3 months. We can scope this accurately in a 30-minute call.
Can speech agents work in noisy open-plan offices?
Yes, with caveats. Deepgram’s Nova-2 handles typical office background noise well. Very loud environments (contact center floors, manufacturing floors) usually need push-to-talk rather than always-on listening, which changes the UX significantly. For SARA, we tested in a quiet office and a standard coworking space. Both worked fine. A contact center floor environment caused too many false end-of-speech triggers.
What languages does Deepgram support at production quality?
Deepgram Nova-2 is optimized for English and delivers production-quality accuracy. Their Whisper-based models support 30+ languages but with higher latency and lower accuracy than Nova-2 on English. If you need non-English speech recognition at production latency, check Deepgram’s language support page and plan for a higher error rate than what I’ve cited here.
How do you handle misrecognized commands?
SARA returns a confidence score with every intent classification. Commands below 0.7 confidence trigger a spoken clarification request (“Did you mean to add a follow-up for Johnson?”). Above 0.85, it executes and confirms. Between 0.7 and 0.85, it executes but logs the command for review. The client’s team reviews the mid-confidence log weekly and uses it to identify which phrasings are causing systematic misclassification, which we address in prompt updates.
Do you need a fine-tuned model or is a general LLM enough?
We haven’t needed custom model training for speech agent intent classification. Claude with a well-structured function schema and 20-30 example utterances per action type handles 91%+ accuracy across 22 action types. If you had 200+ action types, a fine-tuned classifier for the first routing stage would make more sense. For most business use cases, prompt engineering gets you there without the training infrastructure.
If your team is spending significant time on repetitive data entry that follows predictable patterns, voice commands are often the lowest-friction way to eliminate it. Book a 30-minute call and we’ll walk through whether the use case fits the architecture.