Insights
· 10 min read

How We Estimate AI Projects: The Formula We Use

Two-stage AI project estimation: real line items, the multipliers that change every scope, and a template our PM uses on every build.

Dharini S
Dharini S
People and process before product — turning founder visions into shipped tech
Share
How We Estimate AI Projects: The Formula We Use
TL;DR
  • AI projects need two estimates: a rough range before discovery, and a binding number after Sprint 0 resolves the real unknowns.
  • Every estimate has four components: model development, integration, evaluation/QA, and PM/delivery overhead.
  • Three multipliers shift every estimate by 30-60%: data readiness, integration complexity, and the accuracy bar the client actually needs.
  • The most common estimate failure we've seen: assuming the client's data is cleaner than it is. We now ask for a sample before committing to any number.
  • A complete Sprint 0 estimate takes 30-60 minutes to build correctly. We don't shortcut this step.

A client asked me in our first call last year: “Before we get into specifics, can you give me a ballpark number? Just so we know if we’re in the same range.”

I understand why she asked. Two hours of discovery is a real commitment. Nobody wants to spend that time and then find out the budget gap is 4x.

I told her: I can give you a range right now, but the binding estimate comes after our first planning week. Then I explained exactly what goes into it.

That conversation is more or less how every estimate starts for us now. Here’s what I’ve learned about doing this well.

Why a Single Number Before Discovery Is Usually Wrong

Estimating regular software has a workable path. You decompose requirements into tasks, add up the days, apply a buffer, and you have something you can commit to. It’s imperfect, but the inputs are stable.

AI work has unstable inputs. Before Sprint 0, we typically don’t know:

  • Whether the client’s existing data is clean enough to evaluate against
  • How complex the integration is (a “simple API” can hide a 9-year-old monolith)
  • What accuracy target actually makes sense for the user’s workflow
  • Whether this needs custom fine-tuning, a RAG pipeline, or just solid prompting

Those four unknowns can move an estimate by 40-60%. I’ve seen projects where we budgeted six weeks on the model layer and the real work was three. I’ve also seen the reverse. Giving a confident single number before Sprint 0 isn’t being helpful. It’s just guessing with extra authority.

So we give a range before discovery, and a real estimate after it.

The Two-Stage Estimate

Stage one: pre-discovery range. Based on what the client shares in the first call, I put the project into one of three buckets. These are honest approximations, not quotes:

  • Small scope. A single well-defined AI feature, clear inputs and outputs, connects to one system. Range: 4-8 weeks, $5,000-$8,000.
  • Medium scope. Two to three AI features, some data uncertainty, multiple integration points. Range: 8-16 weeks, $15,000-$25,000.
  • Large scope. A multi-model pipeline, significant data work, deep integration. Range: 3-6 months, $30,000+.

I tell the client which bucket they’re in, why, and what Sprint 0 will resolve.

Stage two: post-Sprint 0 binding estimate. After one week of discovery, the major unknowns are documented (and usually resolved). The data has been sampled, the integration depth is clear, the accuracy target has been validated with real examples. This is the number we commit to, broken into components.

The Four Components Every Estimate Has

No matter the project type, every estimate I build has the same four buckets. The ratio between them shifts, but nothing falls outside these four.

Model development. Prompting, retrieval setup, fine-tuning, evaluation loops. This is the work of making the AI do the thing. On a straightforward classification or retrieval project, it runs at 40-50% of the total. On a multi-step pipeline with complex reasoning, it’s often 60-70%.

I don’t estimate “the model layer” as a single number. I break it into: baseline prototype (typically 2-3 days), evaluation against the test set (1-2 days per cycle), and iteration toward the accuracy target. The number of evaluation cycles is the biggest source of variance in this bucket. A well-specified problem usually takes 2-3 cycles to converge. Underspecified problems can take 6-8.

Integration. API endpoints, authentication, data pipelines, UI layer. On a greenfield project where we control the stack, this is 20-30% of the work. On a project where we’re adding AI features to a complex existing product, it’s often 40-50%.

We ask to see the integration surface before estimating this bucket. “A simple API connection” has meant everything from a 4-hour task to three weeks of work, depending on what’s on the other side. Every time I’ve estimated integration without looking at the actual system, I’ve been wrong.

Evaluation and QA. Test set creation, accuracy measurement, regression testing. Most clients underestimate how much time this takes. We budget 15-20% of total time for evaluation on every project. It’s not optional. A model that’s never tested against real data isn’t done, it’s just untested. The HuggingFace Evaluate library is what we use for most text classification and retrieval evaluation; their documentation explains the methodology clearly if you want to understand what this work involves.

PM and delivery overhead. Sprint planning, client demos, handoff documentation, async communication. This runs at about 10-15% of total engineering time. We include it explicitly in every estimate. It’s not hidden in “the work.”

The Three Multipliers

These three inputs can move an estimate by 30-60%:

Data readiness. Well-structured, representative, reasonably labeled data: 0.85x the base estimate. Raw but available data with some cleanup: 1.0x (the baseline). Sparse, inconsistent, or mostly synthetic data: 1.3-1.5x.

I ask for a data sample before closing any binding estimate. Not a description of the data, an actual sample. Every time I’ve skipped this step, I’ve regretted it. The most painful project we’ve run started with the client assuring us the training data was “ready to go.” It was four years of CRM exports in three different column formats, about 40% unlabeled. We added three weeks and had an uncomfortable conversation.

Integration complexity. Greenfield build where we control the stack: 0.8x. Adding AI to an existing product with a documented API: 1.0x. Integrating into a legacy system with undocumented internals or a third-party SaaS product that doesn’t fully expose its API: 1.3-1.5x.

Accuracy bar. Some clients can run their workflow with 80% accuracy and handle the exceptions manually. Others need 95% or the use case doesn’t work at all. I ask this directly: “What happens operationally when the model is wrong 10% of the time? Can the workflow absorb that?”

If the answer is no, the estimate goes up. Going from 80% to 92% accuracy typically takes 2-3x the evaluation and iteration work of getting to 80% in the first place. This isn’t a cost we manufacture. It’s the shape of how model performance curves actually work. The Anthropic model documentation gives a sense of where different models sit on the accuracy-capability spectrum, which is useful context when clients are setting expectations.

The Estimation Template

At the end of Sprint 0, I build a line-by-line estimate and walk through it with the client. Here’s a simplified version of the template:

Sprint 0 completed: [Date]

MODEL DEVELOPMENT
  Baseline prototype:      [X] days
  Evaluation loop × [N]:   [X] days each
  Fine-tuning/iteration:   [X] days
  Subtotal:                [X] days

INTEGRATION
  API/endpoint setup:      [X] days
  Auth + data pipelines:   [X] days
  UI layer (if applicable):[X] days
  Subtotal:                [X] days

EVALUATION & QA
  Test set creation:       [X] days
  Regression testing:      [X] days
  Subtotal:                [X] days

PM & DELIVERY OVERHEAD
  Sprint planning + demos: [X] days
  Handoff documentation:   [X] days
  Subtotal:                [X] days

BASE TOTAL:                [X] days

MULTIPLIERS
  Data readiness:          [0.85 – 1.5]
  Integration complexity:  [0.8 – 1.5]
  Accuracy bar:            [1.0 – 1.3]

ADJUSTED ESTIMATE:         [X] days
BUFFER (15%):              [X] days

DELIVERY ESTIMATE:         [X] days

I share this with every client at Sprint 0 close, not as a take-it-or-leave-it number but as a walkthrough. Every line is a conversation. If a line looks wrong to them, that’s important information. Often they know something about their stack or data that changes a multiplier.

Where Estimates Break

Every project has at least one surprise. The three I see most often:

Data assumptions that don’t hold. Covered above. Ask for a sample. Always.

Integration depth we couldn’t see before Sprint 0. A client described their backend as “pretty standard.” It turned out to be a 9-year-old Rails monolith where authentication was coupled to the data model in a non-obvious way. Integration took three weeks instead of one. We now ask for a read-only environment to probe the integration surface during Sprint 0, not a description of it.

Accuracy expectations that changed mid-project. The client agreed to 82% precision at scope. When the model was running on actual production data, they decided 82% felt too uncertain for their support team. We reset the target to 90% and ran two more evaluation cycles. This is why we validate the accuracy bar with concrete examples in Sprint 0, not just a percentage. “What would you do if you saw this specific wrong output?” is a better question than “what’s your accuracy requirement?”

For the scope change side of estimation drift, the scope creep post goes deeper into how we handle mid-project additions without blowing the original estimate.

FAQ

How much does a typical AI development project cost?

For a single well-defined AI feature (classification, retrieval, summarization), the typical range is $5,000-$8,000 for 4-8 weeks of work. Multi-feature builds run $15,000-$25,000. Multi-model pipelines start at $30,000 and go up based on scope. These ranges assume reasonable data readiness and standard integration complexity. The binding number comes after Sprint 0.

How long does a typical AI project take from kickoff to delivery?

Most single-feature builds land in 4-10 weeks. The fastest project we’ve run was 11 days: small scope, clean data, greenfield integration, low accuracy bar. The longest was 5 months: multi-model pipeline, legacy system integration, 94% accuracy requirement. Data readiness and integration complexity drive timeline more than anything else in the model layer.

What’s included in the AI development services estimate you give?

Every estimate has four buckets: model development, integration, evaluation/QA, and PM/delivery overhead. We show the breakdown, not just a total. If a provider gives you a flat $X number without showing the component split, ask them to show the work. The ratio between components tells you a lot about whether the estimate is realistic.

What happens if the estimate runs over mid-project?

We track actuals against estimate at every sprint. If we’re trending 15% or more over on any component, we call it out at the demo, not at the end of the project. Earlier variance signals mean more options: adjust scope, adjust timeline, adjust requirements. The worst version of an over-running project is one where the team absorbed it silently and hoped to catch up. We don’t do that.

How detailed does my brief need to be to get a reliable estimate?

The most useful inputs are: what the AI needs to do (a specific input-to-output description, not a category), what data you have, and what systems it connects to. If you can describe a real user scenario, that’s usually enough for a pre-discovery range. We’ll get the rest in Sprint 0. The discovery call checklist explains exactly what we’re trying to resolve in that first week.


If you’re scoping an AI build and want to see this framework applied to your specific requirements, book a 30-minute call. We’ll walk through the estimate together and give you a real range before the call ends.

#ai development services#ai project estimation#sprint planning#project management#ai development cost#ai project scope
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Dharini S

Written by

Dharini S

People and process before product — turning founder visions into shipped tech

Dharini sits between the founder's vision and the engineering team, making sure things move in the right direction — whether that's a full-stack product, an LLM integration, or an agent-based solution. Her background in instructional design and program management means she thinks about people first — how they process information, where they get stuck, what they actually need — before jumping to solutions.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us