The compliance AI we shipped hits 94% agreement with human reviewers. It processes 200 calls overnight and has results in the dashboard by morning. Running costs land at roughly $0.04 per analyzed call in API fees.
None of those numbers came from picking the right LLM. They came from five architecture decisions made before and during the build, decisions that determine how the pipeline handles the real conditions of production call recordings: variable audio quality, multiple speakers, compliance rubrics that require judgment, and call volume that keeps growing.
We’ve documented the build story elsewhere including the wrong turns (keyword matching sat at 58% accuracy, real-time streaming introduced an 8-12 second lag that made intervention impossible) and the two-week timeline. This post is specifically about the five architecture choices and what each one costs you if you make it the wrong way.
Choice 1: Transcription Model (Local vs Managed API)
The first instinct when building a compliance system is to keep data on-premise. Sales call recordings are sensitive. You don’t want audio leaving your infrastructure. The obvious pick is a local transcription model, Whisper large-v3, which is the best open-source option available and genuinely performs well on clean audio.
The problem: clean audio is not what you get from real sales calls.
We started the build with Whisper large-v3. On the clean portion of the client’s validation set, it ran at about 4% word error rate. Good enough. On the noisy portion (conference room speakerphone calls, mobile connections with dropout, home-office ambient noise), WER jumped to around 18%. At 18% WER, required regulatory phrases start failing to transcribe correctly. We were passing calls where required compliance language wasn’t clearly detectable in the transcript, not because the rep skipped it, but because the transcription garbled it.
We switched to Deepgram Nova-2 on day three. Nova-2 held at about 6% WER on the same noisy validation set, a 3x improvement on the calls that actually needed accurate transcription. Processing latency dropped significantly too: around 400 milliseconds per minute of audio versus 3.2 seconds per minute for Whisper on our hardware. At 200 calls per day, that’s roughly 2 hours of processing time saved overnight.
Here’s the cost comparison at the 200 calls/day scale:
| Option | Infrastructure | API cost | Noise WER | Latency/min |
|---|---|---|---|---|
| Whisper large-v3 (A100 on-prem) | ~$600-900/mo GPU amortized + ops | $0 | ~18% | 3.2s |
| Deepgram Nova-2 (managed API) | $0 | ~$0.02/call | ~6% | 400ms |
| Deepgram Nova-2 (self-hosted) | GPU + licensing overhead | per-seat | ~6% | ~500ms |
The on-prem estimate uses a single A100 GPU at roughly $10K hardware cost amortized over 18 months, plus maintenance. Real utilization and ops overhead add another 30-40%. For the noise resilience difference, you’d need at least two GPUs to process volume reliably without queue backup.
For compliance systems, the noise resilience argument outweighs the data-residency instinct in most cases. A 3% WER difference sounds small. In a pass/fail system, a missed phrase means a false negative: the call scores compliant when it wasn’t. That’s the failure mode that matters. The transcription model is the first lever in the pipeline, and errors compound downstream through diarization and LLM analysis.
When to go local: If your recording infrastructure is controlled (VoIP over ethernet, headset-only, call center environment with consistent SNR), Whisper on-prem can be the right answer, particularly when data residency is a hard compliance requirement. Test both models on your actual recording sample before committing. The 18% WER I quoted above is from one client’s environment. Your distribution will differ.
Choice 2: Diarization Scope
Compliance is about what the rep said. Not the customer, not both parties combined. To isolate rep turns, you need speaker diarization, which separates the audio by speaker so only rep segments go into LLM analysis.
Most managed transcription APIs include built-in diarization. Deepgram’s works reliably for standard two-party calls: one rep, one customer, clean handoffs. At two parties with reasonable audio quality, built-in diarization produces accurate speaker labels without additional tooling.
Three-party calls are where it breaks.
Multi-party calls (conference rooms, manager-accompanied calls, training sessions with an observer) caused Deepgram’s built-in diarization to produce incorrect speaker attribution reliably. Segments would merge. Rep turns got labeled as customer or vice versa. If our compliance rubric only applies to the rep, attributing a customer statement to the rep label means we’re evaluating the wrong content.
Our client had about 15% of calls with three or more speakers. We ran pyannote.audio as a fallback for those calls. pyannote uses a different diarization architecture that handles multi-party scenarios more accurately, at roughly 2x the processing latency (around 800ms per minute versus 400ms for Deepgram built-in).
The routing decision: a quick participant-count check against call recording metadata runs before diarization. Two-party calls go to Deepgram built-in. Multi-party calls route to pyannote. This adds complexity (two diarization implementations to maintain) but avoids the latency penalty on 85% of calls.
Wrong diarization is worse than no diarization for this use case. If rep and customer turns mix before LLM analysis, the model evaluates both sides against rubric criteria that only apply to one side. False positives and false negatives both degrade reviewer trust faster than any other failure mode in our experience.
The decision to make: What’s your multi-party call percentage? If it’s under 5%, the simpler path is to route all multi-party calls to human review (same as the audio quality gate in Choice 3). If it’s above 10%, the hybrid diarization approach is worth the implementation overhead.
Choice 3: Audio Quality Gate Before the Pipeline
The most counterintuitive decision in this build: routing bad-quality audio to human review rather than through the AI.
The argument against seems obvious. You’re building an AI system specifically to reduce human review. Adding a human review queue feels like a step backward.
The argument for holds under load. When transcription WER climbs above roughly 12-15% on a given call, required regulatory phrases become unreliable to detect. The LLM might parse a garbled phrase as a compliance pass, a compliance fail, or a low-confidence result depending on how the phrase degraded. False negatives on noisy calls erode reviewer trust over time, not on day one but over months as reviewers spot AI results that they know are wrong.
We built a signal-to-noise ratio check that runs before transcription. Calls below the threshold never enter the AI pipeline. They go directly to a human review queue.
About 8% of calls triggered the gate at this client. At 200 calls per day, that’s roughly 16 calls in the human queue per day, a small number relative to the alternative of running 200 calls through a pipeline where 16 would produce degraded output.
The operational implication: your human reviewer capacity has to handle the gate queue without creating backlog. At 200 calls/day with an 8% gate rate, that’s roughly 400 additional calls per month in the human queue. If reviewer capacity can’t absorb that, the threshold needs adjustment. The right gate level depends on your recording infrastructure and how much manual capacity you have.
Why this matters for long-term adoption: Reviewer trust is not a metric that appears in standard AI evaluations, but it’s the variable that determines whether the system actually changes behavior. If reviewers learn that the AI produces wrong results on certain calls, they start treating all output as advisory. They manually review flagged calls anyway. The efficiency gain disappears. A system that’s 94% accurate on every call it processes, plus a small human review queue for the ones it can’t handle, generates more actual compliance improvement than a system that covers 100% of calls at 89% accuracy.
Choice 4: LLM Selection and Prompt Architecture
Once you have a clean diarized transcript of rep turns only, the compliance scoring question is: which model, with what prompt structure?
We tested GPT-4o and Claude 3.5 Sonnet against the same 200-call validation set:
| Model | Accuracy vs reviewers | Explanation quality | Cost per 15-min call | Latency p99 |
|---|---|---|---|---|
| GPT-4o | 94% | Adequate | ~$0.018 | 3.2s |
| Claude 3.5 Sonnet | ~93-95% (run-to-run variance) | Noticeably better | ~$0.024 | 2.8s |
On this task, the accuracy difference between GPT-4o and Claude 3.5 Sonnet was within measurement noise. Both models could meet the 94% threshold. The meaningful difference was in explanation quality: Claude 3.5 Sonnet produced clearer natural-language descriptions of why a requirement was failed, which the coaching team found more useful for rep conversations than GPT-4o’s outputs.
We went with GPT-4o for the initial production deploy primarily on cost. The $0.006/call premium for Claude 3.5 Sonnet adds up to roughly $1,080/year at 200 calls/day. Not enormous, but accuracy was comparable and GPT-4o’s latency was acceptable. The client is now evaluating whether the coaching team’s preference for Claude’s explanations justifies the cost difference.
The decision that paid off: We built a model-agnostic interface. The compliance scoring function takes a model configuration and a prompt as parameters. Swapping to Claude 3.5 Sonnet is a config change, not a code change. One sprint of additional setup work at the start prevented a full rewrite when the client wanted to run the comparison.
The prompt structure matters as much as model selection. This is what we use:
System: You are a compliance analyst reviewing a sales call transcript.
Evaluate each requirement against rep turns only.
Output must be valid JSON matching the schema exactly.
COMPLIANCE REQUIREMENTS:
[numbered list of specific, measurable requirements]
TRANSCRIPT (rep turns only):
[transcript text]
Schema: {
"results": [
{
"requirement_id": integer,
"result": "PASS" or "FAIL",
"quote": string or null,
"timestamp": "MM:SS-MM:SS" or null,
"note": string or null
}
],
"overall": "COMPLIANT" or "NON_COMPLIANT",
"confidence": "HIGH" or "MEDIUM" or "LOW"
}
Why schema enforcement matters for reliability: Without enforced structured output (response_format: { type: "json_object" } in the API call), LLMs return valid JSON reliably on straightforward calls and freeform explanations on complex or ambiguous ones, exactly the calls where you most need structured output. We measured about 3% JSON parsing failures without schema enforcement on early test runs. At 200 calls per day, that’s 6 calls per day that need error handling and manual re-processing. Enforcing the schema brought parsing failures to near zero.
Rubric design is the accuracy lever, not model selection. Specific, measurable requirements (“rep must verbally confirm recording consent within the first 90 seconds”) reach 90-95% accuracy with current models. Vague requirements (“rep must explain the product clearly”) reach 70-80% regardless of which model you use. Two days of rubric workshops with the compliance team were more valuable than any model optimization work we did.
GPT-4o-mini for higher volumes: At 1,000 calls/day, GPT-4o LLM costs alone run around $6,570/year. At that scale, it’s worth testing GPT-4o-mini at roughly $0.001-0.002/call (an order of magnitude cheaper). On simple binary rubrics, it frequently maintains acceptable accuracy. On rubrics requiring contextual judgment, it struggles. Test on your specific rubric before committing.
Choice 5: Pipeline Architecture (Sync vs Async Queue)
The last choice is the one teams most often get wrong on their first production pipeline: whether to process calls synchronously or asynchronously.
Synchronous processing: call arrives via webhook, it goes through transcription and diarization and LLM scoring, the result gets stored, the next call starts. Simple to build, easy to reason about, fine at low volume.
The ceiling it hits: if a 15-minute call takes 10 seconds end-to-end through the pipeline, 200 calls processed sequentially take about 2,000 seconds, or 33 minutes for a single worker. That’s acceptable for same-morning results. But if you’re adding new clients, if call volume doubles, if peak days (end-of-month, audit cycles) send 400 calls through the system, synchronous processing creates a backlog that grows faster than it clears.
We built on an async queue: Redis as the broker, a pool of workers pulling from the queue, each processing one call at a time. Calls arrive via webhook and drop into the queue immediately. Workers process continuously. Results land in the database as they complete. Scaling from 4 to 8 workers for a peak period is a config change.
Why this matters for reliability, not just throughput:
Retry logic becomes possible. A failed call (network error, API timeout, LLM refusal on a particularly long call) returns to the queue with exponential backoff: 30 seconds, then 2 minutes, then 10 minutes. After three failures, it goes to a dead-letter queue for manual inspection. Our coverage stays at 100% of processable calls. In a synchronous system, a failed call is a gap in coverage until someone notices and reruns it manually.
Worker failure is isolated. If one worker crashes mid-call, the in-progress message returns to the queue and another worker picks it up (assuming proper queue acknowledgment semantics, which Redis supports with BRPOPLPUSH). In a synchronous single-process system, a crash mid-run means partial results and unclear state.
Monitoring becomes direct. Queue depth, worker utilization, and processing latency per stage are observable metrics. We set an alert when queue depth exceeds 50 calls unprocessed. In a synchronous system, you find out you’re behind when the compliance team asks why this morning’s dashboard is incomplete.
Infrastructure cost at 200 calls/day: Redis on a small managed instance runs about $15-20/month (AWS ElastiCache cache.t3.micro or equivalent). Workers at this scale are lightweight Python processes; four workers on a 2-vCPU instance handles 200 calls/day without saturation. Total queue and compute infrastructure runs roughly $50-70/month, which is small relative to the API costs.
When synchronous is fine: Below roughly 50 calls/day, the failure modes are manageable and the complexity overhead of a queue is genuine. A simple synchronous processor with good error logging is the right starting point. Build the async queue when volume warrants it, and design the synchronous processor so that adding the queue later is a clean step up, not a rewrite.
The Decisions Before Hiring an Engineering Team
These five choices determine production cost and reliability more than model selection does. They’re also made early, by whoever defines the architecture in the first two weeks. By the time you see results in staging, most of these decisions have consequences that are expensive to reverse.
If you’re evaluating vendors or preparing to build this internally, here’s what you should have clear answers to before the first sprint:
-
Your audio environment. VoIP and headset only, or is mobile and speakerphone common? This determines whether on-prem transcription is viable and what WER you’ll get in practice on the calls that actually need accurate detection.
-
Multi-party call percentage. What share of your calls have three or more speakers? Under 5%, route to human review. Above 10%, plan for the hybrid diarization approach we used.
-
Human review tolerance. Is your compliance team set up to handle a manual queue for bad-audio calls? What percentage of calls can go to human review without creating backlog? The gate rate and reviewer capacity have to match.
-
Rubric specificity. How precise are your compliance requirements right now? “Reps should be professional” is not a rubric the AI can score. “Rep must read the specific disclosure language from the approved script before discussing pricing” is. The rubric workshop is the non-engineering deliverable that determines everything downstream.
-
Call volume trajectory. Are you at 50 calls/day now but planning for 500 in six months? Build the async architecture from the start. Retrofitting it into a synchronous pipeline after launch is a week of work you could avoid.
The system we shipped hit 94% accuracy because the pipeline handled the actual conditions of production call recordings, not because we picked an optimal LLM. The model ran on clean, correctly diarized, high-SNR transcript data. That’s what the first four choices provide. The fifth choice is what keeps it working as volume grows.
FAQ
What does it cost to build a sales call compliance AI?
Build cost for a well-scoped system typically runs $15,000 to $25,000 for a four to six week engagement covering transcription pipeline, diarization, LLM scoring, quality gate, and a dashboard. Ongoing API costs at the 200 calls/day scale run roughly $0.04 per call in transcription and LLM fees. Infrastructure (queue, workers, storage) adds $50-80/month fixed. The manual QA alternative (20 minutes of reviewer time per call at $25/hour) costs around $8.33 per call. The economics favor automation above roughly 15 calls per day.
Should I build custom or use Gong, Chorus, or Observe.ai?
Off-the-shelf wins when your compliance requirements align with what those platforms score: standard sales metrics, coaching flags, deal intelligence. Custom build wins when your rubric includes regulatory-specific criteria with required language, when you’re on telephony infrastructure vendor APIs don’t cover, or when 100% call coverage against an auditable custom rubric is a regulatory requirement. We’ve covered that build-vs-buy decision in more detail here.
How accurate is AI compliance scoring versus human review?
On this build, 94% agreement with human reviewers on a 200-call labeled validation set. The determinant is rubric clarity, not model selection. Specific, measurable criteria with quoted passage requirements reach 90-95% consistently with current models. Vague criteria (“rep must fully explain the product”) run 70-80% regardless of which model you use. Keyword matching alone sits around 55-62%.
How long to deploy from scratch?
Two weeks to a working prototype with a validated rubric and a real call sample. Four to six weeks for production-grade with an async queue, dashboard, and integration into your call recording platform. The non-engineering bottleneck is rubric definition: getting compliance, legal, and sales to agree on specific pass/fail criteria for edge cases takes two to three days minimum when starting from nothing. That workshop is not optional and can’t be done async.
What’s the biggest technical mistake teams make building a compliance AI?
Optimizing model accuracy before defining the pipeline. The compliance system we shipped reached 94% because the transcription handled noisy audio reliably, the diarization produced correct speaker attribution, and the quality gate kept bad-audio calls from degrading reviewer trust. Those four choices were made before the first prompt was written. A 97%-accurate model running on degraded transcripts and mixed speaker turns doesn’t produce a 97%-accurate system. It produces a system reviewers stop trusting.
Evaluating whether to build a sales call compliance AI or figuring out where the architecture complexity actually lives? Book a 30-minute call. We’ll walk through these five choices against your specific call volume, recording infrastructure, and rubric requirements.