Insights

April 22, 2026 · 11 min read

What Good AI Delivery Looks Like: Our Definition of Done

How our PM defines 'done' in AI sprints, the 5-dimension checklist before every sprint review, and what to do when code ships but isn't truly complete.

Dharini S

People and process before product — turning founder visions into shipped tech

What Good AI Delivery Looks Like: Our Definition of Done

TL;DR

In regular software, 'done' means the feature works. In AI, it means the feature works at a threshold the client can act on, a harder and more specific bar.
We use a 5-dimension definition: code complete, evaluation passing, integration ready, client-reviewable, and sprint goal met.
The dimension that catches most projects: evaluation. A model that's 'technically working' and a model that passes the agreed accuracy test are not the same thing.
When a sprint is code-complete but not done by our checklist, we don't demo it. We tell the client what we have and reset the timeline.
Every sprint review starts with reading the done criteria aloud. That one habit has cut our post-review revision cycles by more than half.

On this page

A client asked me in a sprint review whether a feature was “done.” I said yes.

She said, “So I can show this to my users?”

I said, “Not yet. We still need to connect it to your auth system.”

She looked at me. “How is that done?”

She was right to push back. I’d used the word “done” to mean code-complete. She’d heard it as ready-to-ship. Same word, two different projects.

That conversation happened in my third year running AI development projects, and it’s when I started writing definitions down before the sprint started rather than after it ended.

Why “Done” Is Harder to Define in AI Than in Regular Software

In a standard software sprint, done is relatively clean. The feature renders. The form submits. The API returns the expected payload. You write a test, it passes, you move on.

AI adds a layer that regular software doesn’t have: the model’s behavior isn’t binary. A form either submits or it doesn’t. A language model either understands what the user meant or it doesn’t, but “doesn’t” exists on a spectrum from “subtly off” to “completely wrong,” and the client can’t always tell the difference in a demo.

This creates two failure modes I’ve seen on nearly every project.

Failure mode one: the model works, but no one agreed on what “works” means before the sprint started. The client sees the demo, it looks good, they say done. A week after launch, they discover the model misclassifies a specific query type that their users hit constantly. From the client’s side, this feels like a bug that appeared after delivery. From the engineering side, it was never tested for.

Failure mode two: the model works by the agreed metric, but the client didn’t anticipate what the metric would feel like in practice. An AI document parser was passing our 90% accuracy target, but the 10% failures were concentrated on the highest-priority document type. Technically done. Practically a problem.

Both failures come from defining done after the sprint instead of before it.

The Five Dimensions We Check Before Every Sprint Review

We define done along five dimensions at the start of each sprint. Every one has to be satisfied before we call the sprint complete.

Code complete. The feature has been built and merged to the main branch. No half-built pieces. No blocking TODOs. In AI work it’s easy to have the model layer working while the interface layer is still scaffolded. Code complete means the whole stack is connected and the two layers are talking.

Evaluation passing. The model’s output meets the accuracy or quality target we agreed on in Sprint 0. For a classification task, that might be 88% precision on the client’s test set. For a retrieval system, it might be that 9 out of 10 randomly sampled queries return a relevant result in the top three. The target is written down before any code is written.

If the evaluation isn’t passing, the sprint isn’t done. Even if every other dimension is met.

This is the dimension that catches most projects in the first pass. “The model is working” and “the model is passing the agreed eval” are not the same sentence. I’ve seen teams demo a model that looked impressive in the walkthrough but, when we ran the eval suite afterward, failed more than a third of the test cases. That sprint wasn’t done. Shipping it would have been a trust problem, not a save.

Integration ready. The feature can receive real input from the client’s system and return usable output. This means the API endpoint exists, authentication is wired, and at least one end-to-end test has run against real (or representative) data, not just synthetic test fixtures.

A model running in isolation is not done. This dimension is the most common culprit in the “done” vs “ready-to-show-users” confusion I described at the start.

Client-reviewable. The feature is in a state the client can actually interact with. Not a notebook output, not a JSON blob, not a curl response screenshot. A real interface, even if minimal. For sprint demos, we usually build a lightweight wrapper using Streamlit or a simple HTML form if the final UI isn’t ready. The client needs to be able to do the thing the sprint was about.

Sprint goal met. This is the meta-check. We write the sprint goal as a single sentence before the sprint starts. “By the end of this sprint, the client’s support team can paste a ticket into the classifier and get a routing suggestion with a confidence score.” At sprint end, we read that sentence out loud. Can they do that? Yes or no.

If yes, the sprint is done.

The Done Checklist We Actually Use

At the start of every sprint, I fill this in with the engineering lead:

Sprint [N]: Definition of Done

Code complete
  [ ] Feature branch merged to main
  [ ] No blocking TODOs

Evaluation passing
  [ ] Target metric: ___
  [ ] Threshold: ___
  [ ] Test set: ___ (source, size)
  [ ] Current score: ___

Integration ready
  [ ] End-to-end test run against real data
  [ ] Auth wired
  [ ] API documented (even informally)

Client-reviewable
  [ ] Interface exists for demo
  [ ] Client can run through the sprint scenario unassisted

Sprint goal: ___
  [ ] Goal met (yes/no)

The checklist takes about five minutes to fill in at the start of a sprint and ten minutes to review at the end. The Scrum Guide’s definition of done section covers why this kind of explicit agreement matters for velocity consistency. The short version: without it, done means something different to every person in the room, and every person in the room is right by their own definition.

What We Do When a Sprint Fails the Check

The most useful habit I’ve built: I’d rather tell a client “we’re 80% done and here’s exactly what’s left” than call a sprint done when it isn’t and absorb the next week’s debugging silently.

We had a sprint last year where evaluation was passing but integration wasn’t ready. The client’s production database returned timestamps in a different format than the staging environment. Two hours of work to fix. But by the checklist, the sprint wasn’t done.

We called the review anyway and told the client what we had. We walked through the evaluation results, showed the model working against the test set, then said: “Integration isn’t complete yet because we found a data format mismatch this morning. We expect to resolve it by end of day tomorrow and will send you a link to the live environment then.”

The client’s response: “I appreciate you not just showing me the happy path.”

That conversation went better than any sprint review where we’d stretched the definition of done to avoid the uncomfortable part.

For the documentation side of this, we keep a running record in the sprint handoff document that both sides review before the next planning session. When something moves out of a sprint, there’s a written entry explaining why and what trade-off was made.

The Conversation When We’re Almost Done

Sometimes a sprint is 90% done at review time. Everything passes except one dimension, and the client has already cleared their schedule to see the demo.

My default: still do the review, but be explicit about what’s incomplete.

I open those reviews with: “I want to start by telling you where we are. Here’s what’s done and here’s what’s not.” Then I walk through each dimension of the checklist out loud. The PMI’s research on stakeholder communication frames this as managing expectations proactively. In practice, it’s simpler: just tell people what you have.

Clients almost universally appreciate this more than watching a demo that looks polished but has a gap they’ll discover later.

The worst version of almost-done is a team that rushes to make everything look complete for the demo and then spends the next sprint quietly fixing what broke. I’ve inherited those situations. They’re hard to recover from because the client trusted the “done” signal and made plans based on it.

When “Done” Feels Done but the Client Doesn’t Agree

This is a different problem from the ones above, and it comes up less often, but it matters.

A sprint passes all five checks. The evaluation target was 85% precision on the client’s support ticket classifier. We hit 87%. Sprint goal was demonstrated. The client came to the review and said: “It feels off. Some of these responses don’t seem right to me.”

The model wasn’t off. But the client’s intuition about “rightness” didn’t match the metric we’d agreed on. We’d optimized for precision because they’d told us false positives were more costly than missed classifications. That was still true in principle. In practice, seeing specific cases where the model deferred to the “other” category felt wrong even when it was technically the correct behavior given the threshold.

This is why done criteria need to be visible to the client before the sprint starts, not just written down in an engineering doc. The discovery call is where we validate the criteria together. The full process for how we gather the information we need to write accurate done criteria is in our discovery call checklist.

What we did in that case: we didn’t revise the sprint outcome. The sprint was done. We opened a new story in the next sprint backlog: review the classification threshold on the client’s actual traffic data and decide whether to adjust the precision/recall trade-off. That’s a real requirement, and it deserved its own scoping.

The distinction matters. “The sprint is done, but here’s what we want to improve” is a healthy product relationship. “The sprint isn’t done until everyone feels good about it” is a scope that never ends.

FAQ

What accuracy target should I set for an AI feature before calling it done?

The right answer is always tied to user behavior, not to a universal number. We start every Sprint 0 by asking: what accuracy level lets a real user act on this output without a human review step? For a call compliance classifier, our client decided 88% precision was the bar. For a document retrieval system, 9 out of 10 relevant results in the top three was good enough. Set the target before engineering starts, and write down why you chose it.

What happens if the model keeps improving but never hits the accuracy threshold?

We time-box evaluation cycles in the sprint plan. If we’ve run three evaluation loops and the model isn’t converging, we have a direct conversation: is the target right? Is the training data the issue? Is this a model selection problem? We don’t just keep iterating. We diagnose, report, and adjust. Sometimes the honest answer is that the target needs to move. That’s not failure. It’s information.

How do you handle done criteria when requirements change mid-sprint?

Every scope change that affects a done dimension gets documented explicitly. We don’t absorb it and pretend the original criteria still apply. If the client adds a requirement in week two, the done checklist gets updated, the sprint timeline gets reviewed, and both sides confirm before we continue. We never accept a scope change verbally and figure out the impact later.

How long does it take to write done criteria for a new AI sprint?

Usually 30 to 60 minutes at the start of Sprint 0. For follow-on sprints, about 20 minutes in planning because we’re refining, not building from scratch. The longest part is usually agreeing on the evaluation target, because that requires the client to have an opinion about what “good enough” means for their users.

What do AI development services typically cost when you include evaluation cycles?

Evaluation and test-set refinement add 15 to 25% to raw development time on the first version of a new feature. On a four-week sprint, that’s three to five days of work you might not have planned for. This is why we include evaluation setup as a Sprint 0 deliverable. It doesn’t make sense to start building before you know what done looks like.

Every AI project we run starts with a written definition of done before the first line of code. If you’re scoping a build and want to see what we’d define as done for your specific use case, book a 30-minute call and we’ll walk through it together.

#ai development services#sprint planning#delivery process#project management#definition of done#ai project management

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Written by

Dharini S

People and process before product — turning founder visions into shipped tech

Dharini sits between the founder's vision and the engineering team, making sure things move in the right direction — whether that's a full-stack product, an LLM integration, or an agent-based solution. Her background in instructional design and program management means she thinks about people first — how they process information, where they get stuck, what they actually need — before jumping to solutions.

LinkedIn · About us →

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Insights

We Audited 57 AI Blog Posts to Google's Quality Rater Rubric

Insights

Client Communication Template for Every AI Sprint

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

dharini@kalviumlabs.ai WhatsApp

What happens on the call:

You describe your AI product idea

5 min: vision, users, constraints

We ask the hard questions

10 min: what happens when the AI gets it wrong

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

What Good AI Delivery Looks Like: Our Definition of Done

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

Why “Done” Is Harder to Define in AI Than in Regular Software

The Five Dimensions We Check Before Every Sprint Review

The Done Checklist We Actually Use

What We Do When a Sprint Fails the Check

The Conversation When We’re Almost Done

When “Done” Feels Done but the Client Doesn’t Agree

FAQ

What accuracy target should I set for an AI feature before calling it done?

What happens if the model keeps improving but never hits the accuracy threshold?

How do you handle done criteria when requirements change mid-sprint?

How long does it take to write done criteria for a new AI sprint?

What do AI development services typically cost when you include evaluation cycles?

One engineering tradeoff, every Tuesday.

Dharini S

Keep reading

We Audited 57 AI Blog Posts to Google's Quality Rater Rubric

Client Communication Template for Every AI Sprint

You've read the thinking.
The only thing left is a conversation.

What happens on the call:

Want us to sketch what this looks like for you?

See how we've built this in production

Free: AI PRD Generator

Why “Done” Is Harder to Define in AI Than in Regular Software

The Five Dimensions We Check Before Every Sprint Review

The Done Checklist We Actually Use

What We Do When a Sprint Fails the Check

The Conversation When We’re Almost Done

When “Done” Feels Done but the Client Doesn’t Agree

FAQ

What accuracy target should I set for an AI feature before calling it done?

What happens if the model keeps improving but never hits the accuracy threshold?

How do you handle done criteria when requirements change mid-sprint?

How long does it take to write done criteria for a new AI sprint?

What do AI development services typically cost when you include evaluation cycles?

One engineering tradeoff, every Tuesday.

Dharini S

Keep reading

We Audited 57 AI Blog Posts to Google's Quality Rater Rubric

Client Communication Template for Every AI Sprint

You've read the thinking. The only thing left is a conversation.

What happens on the call:

You've read the thinking.
The only thing left is a conversation.