A founder I work with called me eight months after we shipped his product. He’d been happy with it since launch. The model was working, clients were using it, the product team wasn’t getting bug reports.
“I need to renew with you,” he said. “What does year two look like?”
I pulled up the post-launch maintenance estimate I’d put in the original proposal. He’d seen it in the kickoff call, and then we’d both moved on to the build itself.
“This is more than I expected,” he said.
He wasn’t wrong to be surprised. The year two number looked different when he was holding it as a real invoice rather than a line in a slide. And I realized the estimate itself was underselling what maintenance actually costs, because I’d written it before we’d run the product in production for a full year.
This post is the year two conversation I now have with every client before we start the build, not after it.
Why Year Two Costs Are Different
Year one AI costs are mostly build costs: engineering hours, sprint fees, infrastructure setup, the initial integration tax. These are one-time expenses that show up on a proposal and feel finite.
Year two costs are operational. They’re smaller per line item, but they compound, they don’t end, and they arrive on schedules you don’t control.
What clients underestimate about Year 1 AI costs covers token bills and API charges, costs that scale with usage. Year two maintenance is different. It’s about keeping the product from degrading, not just keeping it running.
The five cost categories I now build into every year two estimate are below. None of them are exotic. All of them come up in real maintenance work.
Model Deprecations (The Vendor Forces Your Hand)
OpenAI deprecated GPT-3.5-turbo-0301 in June 2023. Then GPT-4-0314 in September 2023. Then gpt-4-0613 in October 2024. Then gpt-3.5-turbo-0613 and gpt-3.5-turbo-16k-0613 in December 2024. That’s five significant model versions in 18 months, all requiring migration on a deadline.
You can track this yourself on OpenAI’s model deprecation page. The pattern is consistent: roughly six months’ notice, then the model stops responding. When a model you’re running in production gets deprecated, you don’t just swap the model name in one config file. You test the replacement on your use case, because model behavior changes between versions in ways that are consistent within a version but not always predictable across versions. A prompt that reliably returns JSON from GPT-4o-mini doesn’t always do the same thing from GPT-4.1-mini without adjustment. The JSON schema works. The tone and specificity of reasoning shifts.
On two of the products we maintain, model migrations took two to three engineering days each. Not catastrophic. But two to three days of unplanned work, on a deadline, is a maintenance event the original build budget didn’t account for.
What I now include in year two estimates: one model migration per year, with a two-day engineering buffer. Some years you’ll use it. Some years you won’t. You should price it in either way.
Prompt Drift (Accuracy Degrades Quietly)
A model update from the vendor doesn’t have to be a deprecation to change your product’s behavior. Minor version updates, under-the-hood safety adjustments, and changes to the model’s RLHF tuning can all shift how consistently your prompts perform.
We’ve seen this show up in two ways. First: the model starts returning outputs in a slightly different format than before. If your downstream code was parsing those outputs with any fragility, it breaks. Second: the model starts hedging more on edge cases the client’s users hit regularly. Not wrong, just softer and less confident, which causes accuracy scores to drift down.
Both are quiet problems. They don’t trigger errors. They don’t generate exceptions in your monitoring. They generate client complaints, usually framed as: “it’s not working as well as it used to.”
Catching this requires active evaluation: running a representative test set against the live model on a schedule and comparing scores against the baseline. Without that, you’re waiting for the client to find it.
The maintenance cost: quarterly evaluation runs and the engineering time to investigate when scores shift. Usually four to eight hours per quarter, more if an adjustment is needed. Small on its own. Real over a year.
Monitoring Infrastructure You Didn’t Budget for at Launch
The first version of an AI product often ships with lightweight monitoring: error rates, response times, the basics. This is fine for launch. It stops being fine at around the six-month mark, when the client’s team is large enough that edge cases become regular cases.
By month six or seven, most clients start asking for things that require observability infrastructure: how is the model performing on documents from a specific department, why did the classification fail on this particular format, can we see a breakdown of accuracy by user segment.
Answering those questions requires logging at a granularity that a basic launch stack doesn’t have. The data that should have been captured wasn’t. Building it retroactively takes longer than building it at launch, because you have to reconcile historical gaps alongside new instrumentation.
The vendors who help here include Langfuse, LangSmith, and Arize. Costs range from $50-400/month depending on log volume, plus one to two days of engineering to instrument properly.
I now recommend including a lightweight observability setup in every sprint plan, not as an afterthought. A little infrastructure at month two saves a scramble at month eight.
Human Review Queues That Keep Growing
Almost every AI product we build includes some form of human review: a queue where a human confirms or overrides the AI’s output before it takes effect. This is the right pattern for any use case where errors have real consequences, compliance flagging, financial outputs, legal document classification.
At launch, most clients staff this review queue at 5-10% of outputs: the low-confidence ones the model flags itself. At $0.50-1.00 per human review item, this feels manageable.
Six months in, the queue has usually grown. Not because the model got worse, but because the client’s risk tolerance changed as they understood the edge cases better. A compliance team that was comfortable reviewing 5% of calls in month one is reviewing 18% by month six, because they’ve discovered three categories of call they don’t trust the model to classify alone.
That’s not a product failure. It’s a rational response to production data. But it’s a cost line that wasn’t in the original estimate.
What I build into year two projections: a human review assumption that grows to 15-20% by the end of the year, with a per-item cost attached. If the model keeps the number lower, the budget goes unspent. If it drifts up, you’re not surprised.
Reliability Work as Usage Scales
A product that handled 500 requests a day at launch might be handling 3,000 a day by month twelve. The architecture that worked at 500 frequently hits edge cases at 3,000: rate limits, queue backpressure, context window management when multiple requests hit simultaneously, retry logic that turns out to be less robust than it looked in testing.
We’ve seen this pattern across products with real usage growth. The reliability work isn’t a failure of the original build. It’s expected engineering at scale. But expected engineering at scale is still engineering time, and engineering time costs money.
My rough rule: if a product’s request volume doubles in year one, budget a week of reliability-focused engineering time in year two. Not all of that goes to fixes; some of it goes to capacity planning, load testing, and setting up alerting that would have caught problems earlier. That week of time is usually much cheaper than a reliability incident that reaches the client’s users.
What a Year Two Budget Actually Looks Like
Here’s how I now present year two estimates to clients before the build starts:
| Item | Annual estimate |
|---|---|
| Model migration (1 event/year, 2-day buffer) | $2,000-4,000 |
| Prompt evaluation and drift monitoring | $3,000-6,000 |
| Observability infrastructure (tools + setup) | $2,000-8,000/yr |
| Human review queue scaling | $5,000-20,000+ (depends on volume) |
| Reliability and scale work | $3,000-8,000 |
| Total maintenance reserve | $15,000-46,000/yr |
These ranges are wide because they depend heavily on usage volume and the risk tolerance of the product. A compliance-critical product with high review requirements sits at the high end. An internal tool with a small team and low stakes sits at the low end.
The number that holds across almost all cases: budget 20-30% of your year one build cost as an annual maintenance reserve. A $50,000 build should carry a $10,000-15,000/year maintenance line. That’s not a contract commitment. It’s a planning assumption that prevents the “why is year two costing this much” conversation from becoming a trust problem.
If you’re mid-build and this number is new information, the best time to address it is before launch, when the architecture choices are still flexible. Adding observability at month two costs a sprint. Retrofitting it at month ten costs three.
FAQ
How much does ongoing AI maintenance actually cost per year?
For most products we build (B2B tools with 500-5,000 daily requests, real stakes on accuracy), year two maintenance runs $15,000-46,000 per year in dedicated engineering and review costs. The wide range is driven by how much human review the product requires and how aggressively the client’s usage scales. A lightweight internal tool with a small team and low-stakes outputs can run on the low end. A compliance-critical product routing external decisions needs the full reserve.
Does the AI model need regular updates or does it maintain itself?
The model itself is maintained by the vendor. But that’s the problem: vendors update on their own schedule, which sometimes includes behavior changes or deprecations that require work on your side. What needs active maintenance is your prompt layer, your evaluation suite, and your integration code. These degrade without attention, not all at once, but consistently over the course of a year.
Who should own AI maintenance inside a startup?
This is the question most founding teams don’t answer until it becomes urgent. In most B2B startups we work with, the product owner handles evaluation and human review decisions, while a part-time engineer handles infrastructure and migration work. If you don’t have an internal engineer, working with the AI development team that built the product on a maintenance retainer is usually cheaper than calling a new agency every time something needs attention.
What’s the cost of not maintaining an AI product?
The compounding cost of deferred maintenance is what I see most often. Prompt drift goes uncaught, accuracy drops from 91% to 84% over four months, the client’s team starts routing around the tool, and eventually you’re rebuilding trust along with the product. The engineering cost to fix a six-month drift is usually two to three times the cost of catching it quarterly. One month of deferred model migration means the fix happens under deadline pressure, which is always more expensive than planned work.
When is a maintenance retainer worth it vs hiring internally?
For most seed and Series A companies, a maintenance retainer from an AI development services partner is cheaper than a full-time hire until you’re above about 10,000 daily requests and have three or more active AI surfaces in production. Below that threshold, the work is too sporadic to justify a full-time hire and too technical to hand to a generalist. Above it, the internal case usually makes sense on the math.
Building an AI product and want to understand what year two looks like before you commit the year one budget? Book a 30-minute call. We’ll walk through the real maintenance numbers for your specific use case.