A client asked me in a sprint review whether a feature was “done.” I said yes.
She said, “So I can show this to my users?”
I said, “Not yet. We still need to connect it to your auth system.”
She looked at me. “How is that done?”
She was right to push back. I’d used the word “done” to mean code-complete. She’d heard it as ready-to-ship. Same word, two different projects.
That conversation happened in my third year running AI development projects, and it’s when I started writing definitions down before the sprint started rather than after it ended.
Why “Done” Is Harder to Define in AI Than in Regular Software
In a standard software sprint, done is relatively clean. The feature renders. The form submits. The API returns the expected payload. You write a test, it passes, you move on.
AI adds a layer that regular software doesn’t have: the model’s behavior isn’t binary. A form either submits or it doesn’t. A language model either understands what the user meant or it doesn’t, but “doesn’t” exists on a spectrum from “subtly off” to “completely wrong,” and the client can’t always tell the difference in a demo.
This creates two failure modes I’ve seen on nearly every project.
Failure mode one: the model works, but no one agreed on what “works” means before the sprint started. The client sees the demo, it looks good, they say done. A week after launch, they discover the model misclassifies a specific query type that their users hit constantly. From the client’s side, this feels like a bug that appeared after delivery. From the engineering side, it was never tested for.
Failure mode two: the model works by the agreed metric, but the client didn’t anticipate what the metric would feel like in practice. An AI document parser was passing our 90% accuracy target, but the 10% failures were concentrated on the highest-priority document type. Technically done. Practically a problem.
Both failures come from defining done after the sprint instead of before it.
The Five Dimensions We Check Before Every Sprint Review
We define done along five dimensions at the start of each sprint. Every one has to be satisfied before we call the sprint complete.
Code complete. The feature has been built and merged to the main branch. No half-built pieces. No blocking TODOs. In AI work it’s easy to have the model layer working while the interface layer is still scaffolded. Code complete means the whole stack is connected and the two layers are talking.
Evaluation passing. The model’s output meets the accuracy or quality target we agreed on in Sprint 0. For a classification task, that might be 88% precision on the client’s test set. For a retrieval system, it might be that 9 out of 10 randomly sampled queries return a relevant result in the top three. The target is written down before any code is written.
If the evaluation isn’t passing, the sprint isn’t done. Even if every other dimension is met.
This is the dimension that catches most projects in the first pass. “The model is working” and “the model is passing the agreed eval” are not the same sentence. I’ve seen teams demo a model that looked impressive in the walkthrough but, when we ran the eval suite afterward, failed more than a third of the test cases. That sprint wasn’t done. Shipping it would have been a trust problem, not a save.
Integration ready. The feature can receive real input from the client’s system and return usable output. This means the API endpoint exists, authentication is wired, and at least one end-to-end test has run against real (or representative) data, not just synthetic test fixtures.
A model running in isolation is not done. This dimension is the most common culprit in the “done” vs “ready-to-show-users” confusion I described at the start.
Client-reviewable. The feature is in a state the client can actually interact with. Not a notebook output, not a JSON blob, not a curl response screenshot. A real interface, even if minimal. For sprint demos, we usually build a lightweight wrapper using Streamlit or a simple HTML form if the final UI isn’t ready. The client needs to be able to do the thing the sprint was about.
Sprint goal met. This is the meta-check. We write the sprint goal as a single sentence before the sprint starts. “By the end of this sprint, the client’s support team can paste a ticket into the classifier and get a routing suggestion with a confidence score.” At sprint end, we read that sentence out loud. Can they do that? Yes or no.
If yes, the sprint is done.
The Done Checklist We Actually Use
At the start of every sprint, I fill this in with the engineering lead:
Sprint [N]: Definition of Done
Code complete
[ ] Feature branch merged to main
[ ] No blocking TODOs
Evaluation passing
[ ] Target metric: ___
[ ] Threshold: ___
[ ] Test set: ___ (source, size)
[ ] Current score: ___
Integration ready
[ ] End-to-end test run against real data
[ ] Auth wired
[ ] API documented (even informally)
Client-reviewable
[ ] Interface exists for demo
[ ] Client can run through the sprint scenario unassisted
Sprint goal: ___
[ ] Goal met (yes/no)
The checklist takes about five minutes to fill in at the start of a sprint and ten minutes to review at the end. The Scrum Guide’s definition of done section covers why this kind of explicit agreement matters for velocity consistency. The short version: without it, done means something different to every person in the room, and every person in the room is right by their own definition.
What We Do When a Sprint Fails the Check
The most useful habit I’ve built: I’d rather tell a client “we’re 80% done and here’s exactly what’s left” than call a sprint done when it isn’t and absorb the next week’s debugging silently.
We had a sprint last year where evaluation was passing but integration wasn’t ready. The client’s production database returned timestamps in a different format than the staging environment. Two hours of work to fix. But by the checklist, the sprint wasn’t done.
We called the review anyway and told the client what we had. We walked through the evaluation results, showed the model working against the test set, then said: “Integration isn’t complete yet because we found a data format mismatch this morning. We expect to resolve it by end of day tomorrow and will send you a link to the live environment then.”
The client’s response: “I appreciate you not just showing me the happy path.”
That conversation went better than any sprint review where we’d stretched the definition of done to avoid the uncomfortable part.
For the documentation side of this, we keep a running record in the sprint handoff document that both sides review before the next planning session. When something moves out of a sprint, there’s a written entry explaining why and what trade-off was made.
The Conversation When We’re Almost Done
Sometimes a sprint is 90% done at review time. Everything passes except one dimension, and the client has already cleared their schedule to see the demo.
My default: still do the review, but be explicit about what’s incomplete.
I open those reviews with: “I want to start by telling you where we are. Here’s what’s done and here’s what’s not.” Then I walk through each dimension of the checklist out loud. The PMI’s research on stakeholder communication frames this as managing expectations proactively. In practice, it’s simpler: just tell people what you have.
Clients almost universally appreciate this more than watching a demo that looks polished but has a gap they’ll discover later.
The worst version of almost-done is a team that rushes to make everything look complete for the demo and then spends the next sprint quietly fixing what broke. I’ve inherited those situations. They’re hard to recover from because the client trusted the “done” signal and made plans based on it.
When “Done” Feels Done but the Client Doesn’t Agree
This is a different problem from the ones above, and it comes up less often, but it matters.
A sprint passes all five checks. The evaluation target was 85% precision on the client’s support ticket classifier. We hit 87%. Sprint goal was demonstrated. The client came to the review and said: “It feels off. Some of these responses don’t seem right to me.”
The model wasn’t off. But the client’s intuition about “rightness” didn’t match the metric we’d agreed on. We’d optimized for precision because they’d told us false positives were more costly than missed classifications. That was still true in principle. In practice, seeing specific cases where the model deferred to the “other” category felt wrong even when it was technically the correct behavior given the threshold.
This is why done criteria need to be visible to the client before the sprint starts, not just written down in an engineering doc. The discovery call is where we validate the criteria together. The full process for how we gather the information we need to write accurate done criteria is in our discovery call checklist.
What we did in that case: we didn’t revise the sprint outcome. The sprint was done. We opened a new story in the next sprint backlog: review the classification threshold on the client’s actual traffic data and decide whether to adjust the precision/recall trade-off. That’s a real requirement, and it deserved its own scoping.
The distinction matters. “The sprint is done, but here’s what we want to improve” is a healthy product relationship. “The sprint isn’t done until everyone feels good about it” is a scope that never ends.
FAQ
What accuracy target should I set for an AI feature before calling it done?
The right answer is always tied to user behavior, not to a universal number. We start every Sprint 0 by asking: what accuracy level lets a real user act on this output without a human review step? For a call compliance classifier, our client decided 88% precision was the bar. For a document retrieval system, 9 out of 10 relevant results in the top three was good enough. Set the target before engineering starts, and write down why you chose it.
What happens if the model keeps improving but never hits the accuracy threshold?
We time-box evaluation cycles in the sprint plan. If we’ve run three evaluation loops and the model isn’t converging, we have a direct conversation: is the target right? Is the training data the issue? Is this a model selection problem? We don’t just keep iterating. We diagnose, report, and adjust. Sometimes the honest answer is that the target needs to move. That’s not failure. It’s information.
How do you handle done criteria when requirements change mid-sprint?
Every scope change that affects a done dimension gets documented explicitly. We don’t absorb it and pretend the original criteria still apply. If the client adds a requirement in week two, the done checklist gets updated, the sprint timeline gets reviewed, and both sides confirm before we continue. We never accept a scope change verbally and figure out the impact later.
How long does it take to write done criteria for a new AI sprint?
Usually 30 to 60 minutes at the start of Sprint 0. For follow-on sprints, about 20 minutes in planning because we’re refining, not building from scratch. The longest part is usually agreeing on the evaluation target, because that requires the client to have an opinion about what “good enough” means for their users.
What do AI development services typically cost when you include evaluation cycles?
Evaluation and test-set refinement add 15 to 25% to raw development time on the first version of a new feature. On a four-week sprint, that’s three to five days of work you might not have planned for. This is why we include evaluation setup as a Sprint 0 deliverable. It doesn’t make sense to start building before you know what done looks like.
Every AI project we run starts with a written definition of done before the first line of code. If you’re scoping a build and want to see what we’d define as done for your specific use case, book a 30-minute call and we’ll walk through it together.