A founder called me during sprint 3 of a content automation build. We’d shipped the generation pipeline, outputs were the strongest we’d seen, and I went into the review call genuinely pleased.
He opened with: “I’m confused about what the numbers mean.”
The handoff document had said: “Model performance: 84% accuracy.”
What it didn’t say: accuracy against which test set, whether 84% was better than sprint 2, what “accuracy” meant for a text generation task (it’s not the same as classification accuracy), or why we’d quietly updated the evaluation criteria between sprints, which made the two scores technically incomparable.
That call lasted 52 minutes. We answered every question. But the call happened because the handoff wasn’t built for AI.
Standard sprint handoffs work fine for deterministic software. When you ship a feature, it either works or it doesn’t. “The export button now generates a PDF” is a complete statement. The client can verify it on their own.
AI sprints are different. The outputs are probabilistic. The criteria shift as you learn more about the data. A metric that’s genuinely good in week 3 might mean something entirely different than the same number did in week 1. Without context, clients can’t evaluate whether the sprint moved them forward or sideways.
I’ve been running sprint handoffs for standard software projects for a long time. The structure there is solid: five sections, 15 minutes to write, sent within two hours of the sprint review. For AI projects, that structure needs three extra fields and a different framing on two existing ones. This is the version I use now.
What Changes in an AI Sprint
The core problem is uncertainty communication. Standard sprint reports don’t have a mechanism for it.
In regular software sprints, done means done. In AI sprints, done means “we’ve reached a state worth discussing, here’s the honest picture of where it sits, and here’s what we still don’t know.”
Four things make this different from standard project work.
Metrics need context. A number without a baseline is opinion. The client needs to know the current metric, what it was last sprint, what the target is, and what test set it’s measured against. Those four data points are the difference between “we made progress” and “here’s the specific progress we made.”
Evaluation criteria change. This happens constantly in AI projects. You start sprint 1 measuring precision. By sprint 3, you’ve learned that recall matters more because false negatives are more costly than false positives. If you don’t document that shift, the client sees the numbers change and assumes something went wrong. Usually something went right: you understood the problem better.
Data quality is a first-class output. AI sprints regularly surface things about a client’s own data that were unknown at the start. Inconsistent labeling, missing fields, format drift across time periods, vocabulary differences between departments. These findings change scope. Documenting them isn’t optional.
Uncertainty carries forward. Some unknowns from sprint 1 won’t resolve until sprint 4. That’s normal. Clients who’ve worked with standard software agencies often expect sprint outcomes to be final. The handoff needs to name what’s open without making it sound like failure.
The Template
Copy this. Adapt the section names if you need to, but keep all five sections.
Sprint [N] Handoff: [Project Name] Sent: [date, within 2 hours of sprint review]
What we shipped
[Plain-English description of what the sprint produced and what state it’s in. One to two paragraphs. Include any known limitations.]
Current performance: [metric] ([context]: [up/down] from [baseline sprint N-1], tested against [test set description, sample size], target for launch is [target].)
What changed from plan
[What you planned vs. what you actually did, and why. One paragraph. Don’t hide changes.]
Evaluation criteria update (if applicable): [what changed about how you measure success, and why it changed.]
Data findings
[What you discovered about the client’s data this sprint. If no new findings, write “No significant data findings this sprint.” Don’t skip this section.]
What’s next (Sprint [N+1])
[Three outcomes, not tasks. “By end of next sprint, a logged-in user will be able to X.” One paragraph or a short list.]
Blockers
[Items waiting on someone else. Name, deadline, consequence if delayed. If none: “No active blockers.”]
Decisions needed
[Each decision needs: what you’re deciding, a deadline, and what changes depending on the answer. No vague asks.]
That’s the template. It’s deliberately short. The value is in the specificity of each section, not in the length of the document.
A Real AI Sprint Example
Here’s a simplified version from a voice agent project. The client was building a speech-to-action tool for their internal ops team. Sprint 4 of a six-sprint engagement.
Sprint 4 Handoff: Voice Agent Project Sent: April 9, 2026, 6:52 PM
What we shipped
The intent parser is live on staging. It correctly identifies the action type (create, update, delete) and the target entity for 89% of queries in our test set, up from 71% in sprint 3. Median latency: 340ms per query, down from 580ms after we switched from GPT-4o to GPT-4o-mini for the intent layer (see what changed).
Known limitation: the parser fails on compound queries (“move the deadline and also assign it to Priya”). That’s in scope for sprint 5.
Current performance: 89% intent classification accuracy, tested against 200 internal team utterances. Target for launch is 93%.
What changed from plan
We planned to ship multi-entity resolution this sprint. After measuring the sprint 3 baseline at 71%, we decided to stabilize single-entity handling before adding compound resolution. Multi-entity moves to sprint 5. Single-entity is now at 89%, which is stable enough to build on.
Model change: moved from GPT-4o to GPT-4o-mini for the intent layer. Latency dropped by 240ms with no meaningful accuracy difference (89.0% vs 89.3% on the same test set, within statistical noise). Monthly inference cost at projected volume: $180-$220 instead of $620-$800.
Evaluation criteria update: we switched from overall accuracy to per-action-type accuracy this sprint, because “create” and “delete” need to behave differently when uncertain. This makes sprint 3 and sprint 4 numbers comparable within each action type, but not as a single aggregate.
Data findings
The client’s historical query logs (used to build the test set) contain inconsistent action verbs across departments. “Assign,” “delegate,” and “give to” are used interchangeably for the same intent. The parser handles this at 91% accuracy currently. We found 14 specific phrasings from the legal team that it misclassifies consistently. We’ve added them to the training examples.
What’s next (Sprint 5)
Three goals: the model correctly identifies intent for compound queries containing two actions or two entities; the staging integration connects to the client’s project management tool via webhook; end-to-end latency stays under 300ms including the webhook round-trip.
Blockers
Webhook credentials: API key and base URL for the project management tool, needed by April 12. If we don’t have them by then, sprint 5 starts late and the webhook integration moves to sprint 6.
Decisions needed
By April 13: should we build a fallback response for low-confidence queries (below 0.75 confidence score), or fail silently? Fallback adds two engineering days to sprint 5 and requires a small UI component to surface the message. Silent failure ships Wednesday; fallback ships Friday. The compliance implication: if the agent misses a command and doesn’t tell the user, they may not notice. The founder’s call on risk tolerance.
That took 13 minutes to write. The client responded within 90 minutes with the webhook credentials and a clear answer on fallback (build it). Sprint 5 started on time.
The Section That Gets Skipped Most
Data findings. Teams routinely leave it out.
The reasoning, usually: “we didn’t find anything unexpected this sprint.” But “unexpected” is doing a lot of work in that sentence. Clients often don’t know what to look for in their own data. When you’re building a model against a corpus they gave you, you’ll regularly find things they’ve never had a reason to look for: inconsistent labeling across time periods, field values that mean different things in different departments, export artifacts that contaminate the training set.
These findings change scope. If you don’t document them, the client learns about the scope change through the symptoms (a sprint that takes longer than expected, a metric that suddenly plateaus) rather than through a conversation. That’s a much harder fix.
Google’s People + AI Research guidebook covers communicating uncertainty and data constraints in AI systems as a design problem, not just an engineering one. The same principle applies to project communication: making uncertainty visible and legible is part of the delivery, not a footnote.
The format I use: what we found, how it affects performance now, and whether it changes anything in the plan. If you genuinely have no new findings, one sentence is fine. Leaving the section blank creates ambiguity about whether you looked.
Communicating Model Regressions
One thing the template doesn’t handle explicitly: what happens when the sprint produced a regression, not an improvement.
Name it directly. Don’t soften it.
“Sprint 5 changes dropped precision from 87% to 79% on the validation set. We traced it to the compound-query handling changes interacting unexpectedly with the existing single-entity classifier. Sprint 6 starts by reverting the compound-query layer and reintroducing it incrementally. We expect to recover above 85% within two days of sprint 6 starting.”
That paragraph tells the client what happened, why, and what the plan is. Clients who discover regressions through the product rather than through the handoff lose trust faster than clients who hear about it in writing first. Not committing to timelines on unfamiliar requirements is the same principle applied to planning: honesty early is cheaper than revision later.
The PAIR guidebook puts it plainly: the goal is to communicate confidence, not just results. A regression with a clear cause and a clear recovery plan is a different thing than a regression with no explanation. Both need to be in the handoff.
The 15-Minute Rule
I’ve written about the 15-minute rule before. If you can’t write the handoff in 15 minutes, the sprint had structural problems.
That rule applies here too, with one honest caveat: the data findings section might take an extra few minutes in the early sprints while the picture is still forming. Once you’re mid-project and the data is well-understood, it either writes quickly (“no significant new findings”) or it writes quickly because you’ve been tracking the finding across sprints and can summarize it in two sentences.
The goal isn’t a long document. It’s a document a founder can read in three minutes and respond to by end of day. Martin Fowler’s writing on continuous delivery for ML systems frames it well: feedback loops in AI projects need to be short and explicit, not long and assumed. The sprint handoff is one feedback loop. Keep it tight.
FAQ
How is this different from the standard sprint handoff template?
The standard handoff has five sections: what we shipped, what changed, what’s next, blockers, decisions needed. This version adds two fields to “what we shipped” (performance baseline and test set context), adds an evaluation criteria update field to “what changed,” and adds data findings as an explicit fifth section. The structure is familiar; the additions are specific to AI project uncertainty.
How specific should model performance numbers be?
Specific enough to be comparable across sprints. The minimum: current metric, previous metric, test set size, and target. “84% accuracy, up from 71% in sprint 2, tested against 200 labeled samples, target 90% for launch” is complete. “Strong performance on the validation set” is not.
What if the sprint produced a model regression?
Name it directly in the handoff. State the current metric, the baseline, the cause (if known), and the recovery plan. Clients who find out about regressions through the product, rather than through the handoff, almost always respond worse than clients who read about it in writing first.
We’re using an off-the-shelf model, not a fine-tuned one. Does this template still apply?
Yes. The performance context sections apply equally to a prompted GPT-4o setup as they do to a fine-tuned model. “Prompt version 3 produces 91% correct structured outputs on our test set, up from 78% with prompt version 2” is the same format. What matters is that the client can evaluate sprint-to-sprint progress, not how the model is implemented.
How do you handle the decisions section when a client is slow to respond?
One follow-up, with the consequence named: “If we don’t have a decision on fallback handling by tomorrow morning, sprint 5 will default to silent failure and we’ll revisit in sprint 6.” If there’s still no response, I document the default choice in the next handoff. That creates a written record and usually prompts a reply faster than a second question does.
If you’re running AI development projects and want to see what this looks like across a full engagement, book a 30-minute call. I’ll walk you through the handoff from a recent project and tell you honestly where the template held and where we had to adapt it.