Case Study

SARA: Speech-to-Action AI Agent

A voice-driven AI agent that converts spoken instructions into structured actions. Real-time speech recognition, intent parsing, and task execution through natural conversation.

Client AI Speech Agent
Industry AI / Voice Technology
Duration Built in-house
Team 2 engineers
Real-time
Speech processing
Multi-step
Action chains
WebSocket
Streaming pipeline
Sub-2s
Response latency

Why We Built This

Most voice assistants handle simple commands well. "Set a timer." "What's the weather?" Single intent, single action. They break when conversations get real: multi-step instructions, context that carries across turns, actions that depend on the outcome of previous actions.

SARA (Speech-Action Reasoning Agent) is our internal exploration of what a production-grade speech-to-action agent looks like. Not a chatbot with a microphone. An agent that listens, reasons about what you want done, and executes structured actions in sequence.

What It Does

  • Real-time speech recognition with streaming transcription. You don't wait for silence detection to start processing.
  • Intent decomposition that breaks complex spoken instructions into discrete, ordered actions
  • Context tracking across conversation turns. "Do the same thing for last month" works because the agent remembers what "the same thing" means.
  • Action execution against connected systems (APIs, databases, tools) with confirmation and error handling
  • Conversational clarification when instructions are ambiguous. The agent asks instead of guessing.

What makes SARA different from typical voice AI: Most voice interfaces transcribe speech, extract a single intent, and map it to a command. SARA processes speech as a planning problem: decompose the instruction, verify feasibility, execute in order, handle failures at each step. It's an agentic pipeline that happens to take voice as input.

How It Works

The architecture has four layers:

  • Speech layer: WebSocket-based streaming audio capture, real-time transcription via Deepgram, voice activity detection to handle natural pauses without premature cutoff
  • Understanding layer: LLM-based intent parsing that converts transcribed speech into a structured action plan. Handles multi-step instructions, references to previous context, and implicit actions.
  • Execution layer: Tool-calling framework that maps parsed intents to specific API calls or system actions. Each action returns a result that feeds into the next step.
  • Response layer: Generates natural spoken confirmation of what was done, including any clarification questions if the instruction was ambiguous. Text-to-speech output via streaming for low-latency responses.

Technical Decisions

Streaming over batch: We process audio as it arrives rather than waiting for the user to finish speaking. This shaves 1-2 seconds off perceived latency, which matters when you're having a conversation. The tradeoff is handling partial transcripts and mid-sentence corrections, which adds complexity to the intent parsing layer.

Action planning over direct mapping: Instead of mapping each utterance to a single command, SARA generates an action plan (an ordered list of steps) and executes them sequentially. If step 3 fails, it can re-plan from that point rather than failing the entire instruction.

Sub-2 second response target: End-to-end latency from the user finishing a sentence to hearing the first word of the response. This required careful pipeline optimization: streaming transcription, fast LLM inference (GPT-4o with function calling), and streaming TTS output.

Tech Stack

Python FastAPI WebSocket Deepgram GPT-4o Function Calling React

Why This Matters for Clients

SARA isn't a client project. It's a capability demonstration. The underlying patterns (real-time speech processing, agentic task execution, multi-step planning) apply to any product where voice is the input:

  • Voice-driven data queries ("Show me last quarter's sales by region, then compare to the year before")
  • Hands-free workflow automation in field operations
  • Accessible interfaces for users who can't interact with screens
  • Customer service agents that handle multi-step requests without transfers

If you're building something that needs voice as an input channel, the architecture decisions we made here translate directly.

Want something like this built?

Tell us the problem. We'll tell you what 72 hours can produce.

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Chat with us