Insights
· 10 min read

AI Maintenance Costs in Year 2: What Most Founders Miss

Five maintenance costs that don't appear in Year 1 AI budgets: model deprecations, prompt drift, monitoring, human review queues, and reliability work.

Dharini S
Dharini S
People and process before product — turning founder visions into shipped tech
Share
AI Maintenance Costs in Year 2: What Most Founders Miss
TL;DR
  • Year 1 AI costs are mostly build costs. Year 2 costs are mostly maintenance. Not the same number.
  • Model deprecations force rewrites on a vendor's schedule, not yours. OpenAI has deprecated five major models in 18 months.
  • Prompt drift is invisible until accuracy drops. Most teams discover it from a client complaint, not a monitoring alert.
  • Human review queues that started at 5% of outputs tend to creep toward 20% within six months of launch. That's a staffing line item.
  • Budget 20-30% of your Year 1 build cost as an annual maintenance reserve. If you don't spend it, great. If you need it, you're not scrambling.

A founder I work with called me eight months after we shipped his product. He’d been happy with it since launch. The model was working, clients were using it, the product team wasn’t getting bug reports.

“I need to renew with you,” he said. “What does year two look like?”

I pulled up the post-launch maintenance estimate I’d put in the original proposal. He’d seen it in the kickoff call, and then we’d both moved on to the build itself.

“This is more than I expected,” he said.

He wasn’t wrong to be surprised. The year two number looked different when he was holding it as a real invoice rather than a line in a slide. And I realized the estimate itself was underselling what maintenance actually costs, because I’d written it before we’d run the product in production for a full year.

This post is the year two conversation I now have with every client before we start the build, not after it.

Why Year Two Costs Are Different

Year one AI costs are mostly build costs: engineering hours, sprint fees, infrastructure setup, the initial integration tax. These are one-time expenses that show up on a proposal and feel finite.

Year two costs are operational. They’re smaller per line item, but they compound, they don’t end, and they arrive on schedules you don’t control.

What clients underestimate about Year 1 AI costs covers token bills and API charges, costs that scale with usage. Year two maintenance is different. It’s about keeping the product from degrading, not just keeping it running.

The five cost categories I now build into every year two estimate are below. None of them are exotic. All of them come up in real maintenance work.

Model Deprecations (The Vendor Forces Your Hand)

OpenAI deprecated GPT-3.5-turbo-0301 in June 2023. Then GPT-4-0314 in September 2023. Then gpt-4-0613 in October 2024. Then gpt-3.5-turbo-0613 and gpt-3.5-turbo-16k-0613 in December 2024. That’s five significant model versions in 18 months, all requiring migration on a deadline.

You can track this yourself on OpenAI’s model deprecation page. The pattern is consistent: roughly six months’ notice, then the model stops responding. When a model you’re running in production gets deprecated, you don’t just swap the model name in one config file. You test the replacement on your use case, because model behavior changes between versions in ways that are consistent within a version but not always predictable across versions. A prompt that reliably returns JSON from GPT-4o-mini doesn’t always do the same thing from GPT-4.1-mini without adjustment. The JSON schema works. The tone and specificity of reasoning shifts.

On two of the products we maintain, model migrations took two to three engineering days each. Not catastrophic. But two to three days of unplanned work, on a deadline, is a maintenance event the original build budget didn’t account for.

What I now include in year two estimates: one model migration per year, with a two-day engineering buffer. Some years you’ll use it. Some years you won’t. You should price it in either way.

Prompt Drift (Accuracy Degrades Quietly)

A model update from the vendor doesn’t have to be a deprecation to change your product’s behavior. Minor version updates, under-the-hood safety adjustments, and changes to the model’s RLHF tuning can all shift how consistently your prompts perform.

We’ve seen this show up in two ways. First: the model starts returning outputs in a slightly different format than before. If your downstream code was parsing those outputs with any fragility, it breaks. Second: the model starts hedging more on edge cases the client’s users hit regularly. Not wrong, just softer and less confident, which causes accuracy scores to drift down.

Both are quiet problems. They don’t trigger errors. They don’t generate exceptions in your monitoring. They generate client complaints, usually framed as: “it’s not working as well as it used to.”

Catching this requires active evaluation: running a representative test set against the live model on a schedule and comparing scores against the baseline. Without that, you’re waiting for the client to find it.

The maintenance cost: quarterly evaluation runs and the engineering time to investigate when scores shift. Usually four to eight hours per quarter, more if an adjustment is needed. Small on its own. Real over a year.

Monitoring Infrastructure You Didn’t Budget for at Launch

The first version of an AI product often ships with lightweight monitoring: error rates, response times, the basics. This is fine for launch. It stops being fine at around the six-month mark, when the client’s team is large enough that edge cases become regular cases.

By month six or seven, most clients start asking for things that require observability infrastructure: how is the model performing on documents from a specific department, why did the classification fail on this particular format, can we see a breakdown of accuracy by user segment.

Answering those questions requires logging at a granularity that a basic launch stack doesn’t have. The data that should have been captured wasn’t. Building it retroactively takes longer than building it at launch, because you have to reconcile historical gaps alongside new instrumentation.

The vendors who help here include Langfuse, LangSmith, and Arize. Costs range from $50-400/month depending on log volume, plus one to two days of engineering to instrument properly.

I now recommend including a lightweight observability setup in every sprint plan, not as an afterthought. A little infrastructure at month two saves a scramble at month eight.

Human Review Queues That Keep Growing

Almost every AI product we build includes some form of human review: a queue where a human confirms or overrides the AI’s output before it takes effect. This is the right pattern for any use case where errors have real consequences, compliance flagging, financial outputs, legal document classification.

At launch, most clients staff this review queue at 5-10% of outputs: the low-confidence ones the model flags itself. At $0.50-1.00 per human review item, this feels manageable.

Six months in, the queue has usually grown. Not because the model got worse, but because the client’s risk tolerance changed as they understood the edge cases better. A compliance team that was comfortable reviewing 5% of calls in month one is reviewing 18% by month six, because they’ve discovered three categories of call they don’t trust the model to classify alone.

That’s not a product failure. It’s a rational response to production data. But it’s a cost line that wasn’t in the original estimate.

What I build into year two projections: a human review assumption that grows to 15-20% by the end of the year, with a per-item cost attached. If the model keeps the number lower, the budget goes unspent. If it drifts up, you’re not surprised.

Reliability Work as Usage Scales

A product that handled 500 requests a day at launch might be handling 3,000 a day by month twelve. The architecture that worked at 500 frequently hits edge cases at 3,000: rate limits, queue backpressure, context window management when multiple requests hit simultaneously, retry logic that turns out to be less robust than it looked in testing.

We’ve seen this pattern across products with real usage growth. The reliability work isn’t a failure of the original build. It’s expected engineering at scale. But expected engineering at scale is still engineering time, and engineering time costs money.

My rough rule: if a product’s request volume doubles in year one, budget a week of reliability-focused engineering time in year two. Not all of that goes to fixes; some of it goes to capacity planning, load testing, and setting up alerting that would have caught problems earlier. That week of time is usually much cheaper than a reliability incident that reaches the client’s users.

What a Year Two Budget Actually Looks Like

Here’s how I now present year two estimates to clients before the build starts:

ItemAnnual estimate
Model migration (1 event/year, 2-day buffer)$2,000-4,000
Prompt evaluation and drift monitoring$3,000-6,000
Observability infrastructure (tools + setup)$2,000-8,000/yr
Human review queue scaling$5,000-20,000+ (depends on volume)
Reliability and scale work$3,000-8,000
Total maintenance reserve$15,000-46,000/yr

These ranges are wide because they depend heavily on usage volume and the risk tolerance of the product. A compliance-critical product with high review requirements sits at the high end. An internal tool with a small team and low stakes sits at the low end.

The number that holds across almost all cases: budget 20-30% of your year one build cost as an annual maintenance reserve. A $50,000 build should carry a $10,000-15,000/year maintenance line. That’s not a contract commitment. It’s a planning assumption that prevents the “why is year two costing this much” conversation from becoming a trust problem.

If you’re mid-build and this number is new information, the best time to address it is before launch, when the architecture choices are still flexible. Adding observability at month two costs a sprint. Retrofitting it at month ten costs three.

FAQ

How much does ongoing AI maintenance actually cost per year?

For most products we build (B2B tools with 500-5,000 daily requests, real stakes on accuracy), year two maintenance runs $15,000-46,000 per year in dedicated engineering and review costs. The wide range is driven by how much human review the product requires and how aggressively the client’s usage scales. A lightweight internal tool with a small team and low-stakes outputs can run on the low end. A compliance-critical product routing external decisions needs the full reserve.

Does the AI model need regular updates or does it maintain itself?

The model itself is maintained by the vendor. But that’s the problem: vendors update on their own schedule, which sometimes includes behavior changes or deprecations that require work on your side. What needs active maintenance is your prompt layer, your evaluation suite, and your integration code. These degrade without attention, not all at once, but consistently over the course of a year.

Who should own AI maintenance inside a startup?

This is the question most founding teams don’t answer until it becomes urgent. In most B2B startups we work with, the product owner handles evaluation and human review decisions, while a part-time engineer handles infrastructure and migration work. If you don’t have an internal engineer, working with the AI development team that built the product on a maintenance retainer is usually cheaper than calling a new agency every time something needs attention.

What’s the cost of not maintaining an AI product?

The compounding cost of deferred maintenance is what I see most often. Prompt drift goes uncaught, accuracy drops from 91% to 84% over four months, the client’s team starts routing around the tool, and eventually you’re rebuilding trust along with the product. The engineering cost to fix a six-month drift is usually two to three times the cost of catching it quarterly. One month of deferred model migration means the fix happens under deadline pressure, which is always more expensive than planned work.

When is a maintenance retainer worth it vs hiring internally?

For most seed and Series A companies, a maintenance retainer from an AI development services partner is cheaper than a full-time hire until you’re above about 10,000 daily requests and have three or more active AI surfaces in production. Below that threshold, the work is too sporadic to justify a full-time hire and too technical to hand to a generalist. Above it, the internal case usually makes sense on the math.


Building an AI product and want to understand what year two looks like before you commit the year one budget? Book a 30-minute call. We’ll walk through the real maintenance numbers for your specific use case.

#ai development cost#ai development services#ai maintenance#ai project management#ai product costs#post-launch ai
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Dharini S

Written by

Dharini S

People and process before product — turning founder visions into shipped tech

Dharini sits between the founder's vision and the engineering team, making sure things move in the right direction — whether that's a full-stack product, an LLM integration, or an agent-based solution. Her background in instructional design and program management means she thinks about people first — how they process information, where they get stuck, what they actually need — before jumping to solutions.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us