About six weeks after we shipped the call compliance system, the client’s head of compliance asked for a retrospective. He’d written notes on every major decision point from the project. I expected him to want to revisit the technical choices. Instead, his first slide was titled: “The decisions we made before writing code were worth more than the decisions we made during the build.”
We’ve written about the full architecture and what we tried that didn’t work and the patterns we found once the system was processing calls at scale. The full engagement is documented in our call analyzer case study. This post is about the five decisions that the client flagged in that retrospective as the ones that determined the outcome. Not all of them are technical.
Decision 1: Pre-Agree on the Accuracy Threshold
Before we ran a single evaluation, we agreed on what “good enough” meant: 94% agreement with human reviewers.
That number came from a specific calculation. The client’s two most experienced compliance reviewers agreed with each other on 95.2% of calls when they independently scored the same set. We targeted slightly below that, because matching human-level agreement was the realistic ceiling for an AI system, not the floor.
The reason this matters: without a pre-agreed threshold, you end up in an unwinnable conversation after the build. The client sees the system scoring at 93.8% and asks whether that’s acceptable. The engineering team says “it’s close to the target.” The client says “it’s below it.” Everyone had a different number in their head and nobody wrote it down.
Writing it down beforehand turned the evaluation into a binary pass/fail against a known standard. We hit 94% on the evaluation set. The project moved forward without negotiation.
This is a business decision, not an engineering one. Your vendor can tell you what’s achievable; only you can tell them what’s good enough for production. If you’re commissioning a project like this, push to get that number in writing before evaluation starts, not after.
Decision 2: One Rubric Owner, Not a Committee
The compliance checklist took two days to finalize. It could have taken three weeks.
The thing that kept it to two days was naming a single decision-maker on day one. The client’s compliance lead owned the rubric. Legal could add comments; sales could flag concerns; compliance sub-teams could surface edge cases. But the compliance lead had final sign-off, and once he signed off, the rubric was locked.
Without that structure, here’s what we watched almost happen: legal wanted to add eight criteria covering regulatory disclosure language from three different jurisdictions. Sales wanted to remove two criteria they considered “too strict.” Three separate compliance managers each had a list of edge cases they wanted explicitly addressed.
All of those inputs were reasonable. None of them, individually, were the problem. The problem was that without a single owner, every conversation re-opened the whole rubric. We’d have been in workshops for two weeks.
Single owner, final authority, locked rubric on day three. That’s the decision that let us start building.
Decision 3: Route Bad Audio Away from the AI
Deepgram Nova-2 holds at about 6% word error rate on clean audio. On noisy recordings (speakerphone calls, mobile connections with dropout, conference rooms with echo), it can drift significantly higher. At high error rates, required regulatory phrases might not transcribe accurately enough to detect.
The naive response is: make the AI more tolerant of transcription errors. Prompt it to accept variations. Use fuzzy matching.
We went the other way. We built a noise quality check that ran before transcription and flagged low signal-to-noise recordings for human review. They never entered the AI pipeline at all.
This felt backward at first. We were supposed to reduce manual review workload, and now we were creating a new manual queue. But the alternative was worse. If reviewers learn that the AI produces wrong results on certain calls, they start second-guessing its output on all calls. A system that’s 94% accurate on the calls it processes, plus a smaller human review queue for the ones it can’t handle reliably, beats a system that’s 89% accurate across everything.
Wrong AI results erode trust faster than they save work. The routing decision kept the automated outputs trustworthy.
Decision 4: Batch First, Real-Time Later
The client wanted live call monitoring. A supervisor watching a dashboard could intervene if a rep missed a required disclosure in the first 90 seconds.
We spent a day evaluating this. The latency math didn’t work: streaming transcription with chunked audio introduced an 8 to 12 second lag between speech and any available text. By the time analysis came back, the moment in the call had passed. And live monitoring requires supervisors actively watching dashboards during every call, which was a significant operational change the client’s team wasn’t set up for.
More importantly, we asked what the client actually needed to solve. The core problem wasn’t catching individual issues in the moment. It was identifying which reps had systematic compliance gaps so managers could run coaching conversations with them. That’s a post-call analysis job.
We scoped live monitoring as an explicit phase two, got written agreement on that, and shipped batch analysis of all previous-day calls with results available each morning. Two months after launch, the client told us they still haven’t needed phase two. The next-morning batch view was enough for everything they were actually doing.
Saying no to a feature isn’t always the right call. In this case, the feature the client asked for would have doubled build complexity while solving a problem they didn’t actually have.
Decision 5: Design for the Monday Morning Workflow
The first version of the dashboard was built around what the system knew: individual call views with per-criterion pass/fail breakdowns, detailed quote extraction, overall compliance score per call.
We watched the QA team use it for one week. They didn’t start with individual call views. They came in Monday morning and immediately asked: who needs coaching this week?
The system had that data. But the default view started with a list of recent calls sorted by date. Getting from that view to “here are the three reps with the worst compliance rates this week” took four clicks. Within two days of watching reviewers use the product, it was clear the entry point was wrong.
We rebuilt the default view as a team summary sorted by compliance rate, with a current-week-versus-last-week comparison at the top. Individual call details moved to a second level. The system had always been able to answer “who needs coaching,” but we’d buried the answer where nobody would find it.
This wasn’t a technical insight. It came from watching people use the product, not from asking them what they wanted. When we asked in the initial scoping, they described a system for reviewing calls. When we watched them use it, they were managing coaching conversations. Those are different workflows and they need different entry points.
GPT-4o’s structured output made it relatively easy to restructure the data layer once we understood the real workflow. The harder part was recognizing the mismatch in the first place.
The Pattern Across All Five
Looking back, each of these decisions reduced ambiguity in a different place. Pre-agreed accuracy threshold reduced post-build negotiation ambiguity. Single rubric owner reduced design-phase ambiguity. Audio quality gating reduced output ambiguity. Batch-first scoping reduced scope ambiguity. Watching the real workflow reduced requirements ambiguity.
None of these are AI insights. They’re project execution insights that happened to apply to an AI project. The system achieved 94% compliance accuracy, 100% call coverage (with the bad-audio queue), and a 95% reduction in manual QA time. The technical choices contributed to that. But the five decisions above are why the project didn’t spend six months circling in workshops.
If you’re commissioning a call compliance AI or anything with similar requirements (structured scoring, rubric-based evaluation, scale), these are the questions worth having clear answers to before your first engineering conversation.
FAQ
How much does it cost to build a sales call compliance AI?
For a system processing a few hundred calls per day, the build typically runs $15,000 to $25,000 for a four to six week engagement covering transcription, diarization, LLM scoring, and a basic dashboard. Ongoing costs depend on call volume: at $0.04 per call analyzed, a team running 200 calls per day is looking at around $240 per month in compute. That’s almost always cheaper than the manual QA the system replaces.
How long does it take to build one?
Two weeks to a working prototype with a compliance rubric your team has validated, four to six weeks for a production-grade system with a dashboard and integration into your existing call recording platform. The timeline assumes you can dedicate a compliance lead to the rubric workshops in week one, since that’s the non-engineering bottleneck.
Should I build custom or buy Gong/Chorus/Observe.ai?
Off-the-shelf tools win when your compliance rubric is standard and your call volume doesn’t require custom scoring logic. Custom build wins when your rubric is proprietary, when you need to score every call against regulatory-specific criteria, or when vendor APIs don’t cover your telephony stack. We’ve written a longer comparison in the call analyzer build post.
What accuracy should I expect?
Realistically: 90 to 95% agreement with your human reviewers, depending on how precisely you’ve defined your rubric. The 94% we hit came from a well-defined rubric and a clean evaluation set. Vague rubric criteria or a noisy evaluation set will reduce that number. Keyword matching alone typically runs 55 to 62%.
What do I need to define before we start building?
Three things: (1) a written compliance rubric with specific pass/fail criteria, (2) a labeled evaluation set of 100 to 200 calls that a human has scored against that rubric, and (3) a named decision-maker who can sign off on rubric changes. Engineering teams can move fast once those three things exist. Without them, the project stalls on definition work.
Building a sales call compliance AI or trying to figure out whether to build or buy? Book a 30-minute call and we’ll walk you through the decision framework we use with clients.