A client asked me in our first call last year: “Before we get into specifics, can you give me a ballpark number? Just so we know if we’re in the same range.”
I understand why she asked. Two hours of discovery is a real commitment. Nobody wants to spend that time and then find out the budget gap is 4x.
I told her: I can give you a range right now, but the binding estimate comes after our first planning week. Then I explained exactly what goes into it.
That conversation is more or less how every estimate starts for us now. Here’s what I’ve learned about doing this well.
Why a Single Number Before Discovery Is Usually Wrong
Estimating regular software has a workable path. You decompose requirements into tasks, add up the days, apply a buffer, and you have something you can commit to. It’s imperfect, but the inputs are stable.
AI work has unstable inputs. Before Sprint 0, we typically don’t know:
- Whether the client’s existing data is clean enough to evaluate against
- How complex the integration is (a “simple API” can hide a 9-year-old monolith)
- What accuracy target actually makes sense for the user’s workflow
- Whether this needs custom fine-tuning, a RAG pipeline, or just solid prompting
Those four unknowns can move an estimate by 40-60%. I’ve seen projects where we budgeted six weeks on the model layer and the real work was three. I’ve also seen the reverse. Giving a confident single number before Sprint 0 isn’t being helpful. It’s just guessing with extra authority.
So we give a range before discovery, and a real estimate after it.
The Two-Stage Estimate
Stage one: pre-discovery range. Based on what the client shares in the first call, I put the project into one of three buckets. These are honest approximations, not quotes:
- Small scope. A single well-defined AI feature, clear inputs and outputs, connects to one system. Range: 4-8 weeks, $5,000-$8,000.
- Medium scope. Two to three AI features, some data uncertainty, multiple integration points. Range: 8-16 weeks, $15,000-$25,000.
- Large scope. A multi-model pipeline, significant data work, deep integration. Range: 3-6 months, $30,000+.
I tell the client which bucket they’re in, why, and what Sprint 0 will resolve.
Stage two: post-Sprint 0 binding estimate. After one week of discovery, the major unknowns are documented (and usually resolved). The data has been sampled, the integration depth is clear, the accuracy target has been validated with real examples. This is the number we commit to, broken into components.
The Four Components Every Estimate Has
No matter the project type, every estimate I build has the same four buckets. The ratio between them shifts, but nothing falls outside these four.
Model development. Prompting, retrieval setup, fine-tuning, evaluation loops. This is the work of making the AI do the thing. On a straightforward classification or retrieval project, it runs at 40-50% of the total. On a multi-step pipeline with complex reasoning, it’s often 60-70%.
I don’t estimate “the model layer” as a single number. I break it into: baseline prototype (typically 2-3 days), evaluation against the test set (1-2 days per cycle), and iteration toward the accuracy target. The number of evaluation cycles is the biggest source of variance in this bucket. A well-specified problem usually takes 2-3 cycles to converge. Underspecified problems can take 6-8.
Integration. API endpoints, authentication, data pipelines, UI layer. On a greenfield project where we control the stack, this is 20-30% of the work. On a project where we’re adding AI features to a complex existing product, it’s often 40-50%.
We ask to see the integration surface before estimating this bucket. “A simple API connection” has meant everything from a 4-hour task to three weeks of work, depending on what’s on the other side. Every time I’ve estimated integration without looking at the actual system, I’ve been wrong.
Evaluation and QA. Test set creation, accuracy measurement, regression testing. Most clients underestimate how much time this takes. We budget 15-20% of total time for evaluation on every project. It’s not optional. A model that’s never tested against real data isn’t done, it’s just untested. The HuggingFace Evaluate library is what we use for most text classification and retrieval evaluation; their documentation explains the methodology clearly if you want to understand what this work involves.
PM and delivery overhead. Sprint planning, client demos, handoff documentation, async communication. This runs at about 10-15% of total engineering time. We include it explicitly in every estimate. It’s not hidden in “the work.”
The Three Multipliers
These three inputs can move an estimate by 30-60%:
Data readiness. Well-structured, representative, reasonably labeled data: 0.85x the base estimate. Raw but available data with some cleanup: 1.0x (the baseline). Sparse, inconsistent, or mostly synthetic data: 1.3-1.5x.
I ask for a data sample before closing any binding estimate. Not a description of the data, an actual sample. Every time I’ve skipped this step, I’ve regretted it. The most painful project we’ve run started with the client assuring us the training data was “ready to go.” It was four years of CRM exports in three different column formats, about 40% unlabeled. We added three weeks and had an uncomfortable conversation.
Integration complexity. Greenfield build where we control the stack: 0.8x. Adding AI to an existing product with a documented API: 1.0x. Integrating into a legacy system with undocumented internals or a third-party SaaS product that doesn’t fully expose its API: 1.3-1.5x.
Accuracy bar. Some clients can run their workflow with 80% accuracy and handle the exceptions manually. Others need 95% or the use case doesn’t work at all. I ask this directly: “What happens operationally when the model is wrong 10% of the time? Can the workflow absorb that?”
If the answer is no, the estimate goes up. Going from 80% to 92% accuracy typically takes 2-3x the evaluation and iteration work of getting to 80% in the first place. This isn’t a cost we manufacture. It’s the shape of how model performance curves actually work. The Anthropic model documentation gives a sense of where different models sit on the accuracy-capability spectrum, which is useful context when clients are setting expectations.
The Estimation Template
At the end of Sprint 0, I build a line-by-line estimate and walk through it with the client. Here’s a simplified version of the template:
Sprint 0 completed: [Date]
MODEL DEVELOPMENT
Baseline prototype: [X] days
Evaluation loop × [N]: [X] days each
Fine-tuning/iteration: [X] days
Subtotal: [X] days
INTEGRATION
API/endpoint setup: [X] days
Auth + data pipelines: [X] days
UI layer (if applicable):[X] days
Subtotal: [X] days
EVALUATION & QA
Test set creation: [X] days
Regression testing: [X] days
Subtotal: [X] days
PM & DELIVERY OVERHEAD
Sprint planning + demos: [X] days
Handoff documentation: [X] days
Subtotal: [X] days
BASE TOTAL: [X] days
MULTIPLIERS
Data readiness: [0.85 – 1.5]
Integration complexity: [0.8 – 1.5]
Accuracy bar: [1.0 – 1.3]
ADJUSTED ESTIMATE: [X] days
BUFFER (15%): [X] days
DELIVERY ESTIMATE: [X] days
I share this with every client at Sprint 0 close, not as a take-it-or-leave-it number but as a walkthrough. Every line is a conversation. If a line looks wrong to them, that’s important information. Often they know something about their stack or data that changes a multiplier.
Where Estimates Break
Every project has at least one surprise. The three I see most often:
Data assumptions that don’t hold. Covered above. Ask for a sample. Always.
Integration depth we couldn’t see before Sprint 0. A client described their backend as “pretty standard.” It turned out to be a 9-year-old Rails monolith where authentication was coupled to the data model in a non-obvious way. Integration took three weeks instead of one. We now ask for a read-only environment to probe the integration surface during Sprint 0, not a description of it.
Accuracy expectations that changed mid-project. The client agreed to 82% precision at scope. When the model was running on actual production data, they decided 82% felt too uncertain for their support team. We reset the target to 90% and ran two more evaluation cycles. This is why we validate the accuracy bar with concrete examples in Sprint 0, not just a percentage. “What would you do if you saw this specific wrong output?” is a better question than “what’s your accuracy requirement?”
For the scope change side of estimation drift, the scope creep post goes deeper into how we handle mid-project additions without blowing the original estimate.
FAQ
How much does a typical AI development project cost?
For a single well-defined AI feature (classification, retrieval, summarization), the typical range is $5,000-$8,000 for 4-8 weeks of work. Multi-feature builds run $15,000-$25,000. Multi-model pipelines start at $30,000 and go up based on scope. These ranges assume reasonable data readiness and standard integration complexity. The binding number comes after Sprint 0.
How long does a typical AI project take from kickoff to delivery?
Most single-feature builds land in 4-10 weeks. The fastest project we’ve run was 11 days: small scope, clean data, greenfield integration, low accuracy bar. The longest was 5 months: multi-model pipeline, legacy system integration, 94% accuracy requirement. Data readiness and integration complexity drive timeline more than anything else in the model layer.
What’s included in the AI development services estimate you give?
Every estimate has four buckets: model development, integration, evaluation/QA, and PM/delivery overhead. We show the breakdown, not just a total. If a provider gives you a flat $X number without showing the component split, ask them to show the work. The ratio between components tells you a lot about whether the estimate is realistic.
What happens if the estimate runs over mid-project?
We track actuals against estimate at every sprint. If we’re trending 15% or more over on any component, we call it out at the demo, not at the end of the project. Earlier variance signals mean more options: adjust scope, adjust timeline, adjust requirements. The worst version of an over-running project is one where the team absorbed it silently and hoped to catch up. We don’t do that.
How detailed does my brief need to be to get a reliable estimate?
The most useful inputs are: what the AI needs to do (a specific input-to-output description, not a category), what data you have, and what systems it connects to. If you can describe a real user scenario, that’s usually enough for a pre-discovery range. We’ll get the rest in Sprint 0. The discovery call checklist explains exactly what we’re trying to resolve in that first week.
If you’re scoping an AI build and want to see this framework applied to your specific requirements, book a 30-minute call. We’ll walk through the estimate together and give you a real range before the call ends.