Three founders called me in the past two months. Each had already signed a contract with an AI development agency. Each was calling because something had gone wrong.
First founder: B2B SaaS, $2M raised, needed an AI classification feature for their dashboard. Got a 38-page proposal within 36 hours of a discovery call. The proposal mentioned “large language models” 14 times and named zero specific models. They signed anyway. Month four: still in development. The agency blamed model latency. The actual problem was they’d scoped the project without understanding the founder’s data shape.
Second founder: marketplace startup, needed a matching algorithm with AI scoring. The agency promised “live in 8 weeks.” It took 22. Their explanation: “AI is inherently uncertain.” The contract had no milestone gates and the founder had no recourse.
Third founder: SaaS tool for HR teams, needed a document processing pipeline. The agency said they required a 6-week, $30K discovery phase before they could quote the build. The founder paid. What they received was a requirements document and a slightly more expensive proposal.
None of these are bad founders. They’re smart people who got burned by an evaluation process that rewards polished sales conversations over actual engineering capability. That’s the AI agency vendor selection problem: the signals that predict failure are visible during the sales cycle, not after the contract is signed. You just need to know what to watch for.
Why AI Agencies Are Harder to Evaluate Than Other Vendors
If you’re buying from a web design agency or a traditional software shop, evaluation is relatively straightforward. You look at portfolios, call a reference, compare deliverables to a spec.
AI development doesn’t work that way. Three structural problems make it harder:
Legitimate uncertainty creates cover for overpromising. AI projects genuinely have more variance than traditional software builds. Agencies exploit this. “AI is unpredictable” is both true and an excuse. Real AI teams distinguish between what’s genuinely uncertain (does your data support this use case?) and what’s just poor scoping (they didn’t ask the right questions upfront).
NDAs make track records harder to verify. Most AI agency case studies are anonymized. This is legitimate, but it means you can’t easily call a reference and validate the story you’re being told. You’re making a significant decision with limited verifiable evidence.
Anyone can claim AI experience. Unlike regulated industries where credentials matter, any software shop can rebrand as an “AI development agency” without having shipped a production AI system. The market grew fast in 2023-2024. A lot of the shops that appeared in that window are building their first real AI product alongside yours.
The vendor conversation is doing a lot of work in an AI agency evaluation. Which signals in that conversation predict good outcomes vs. bad ones?
Three Questions to Ask Yourself First
Before you run through the red flags, get clear on your own situation.
What specific outcome do you need in 90 days? Not “build AI into our product” but something like: “reduce average support ticket resolution from 8 minutes to under 2 minutes using AI classification and suggested responses.” The more specific you can be, the faster you’ll surface whether the agency has actually built this before.
Can you verify any of their claims? Do they have a live public demo? A named client willing to take a reference call? A GitHub repo showing relevant prior work? The answer shapes how much you need to compensate in other parts of the conversation.
What happens if this goes sideways? If you’re betting your Q3 roadmap on a single agency delivering, the evaluation stakes are higher. If this is an experiment with runway to recover, your tolerance for uncertainty is different. Know your position before you negotiate.
Now the red flags.
Red Flags in the Proposal Phase (1–4)
1. The proposal arrived within 24 hours of your first call.
A serious technical team needs time to think about your problem. Sales documents don’t require thinking. If you got a 20-30 page proposal within a day of a one-hour discovery call, what you received was a template with your company name pasted in. What specifically did they learn about your problem in 60 minutes that informed this proposal? If the answer is nothing that required actual thought, that’s the tell.
2. The proposal is long but contains no specific architecture decisions.
“We’ll use state-of-the-art AI models and modern cloud infrastructure” means nothing. A proposal from a team that’s actually thought about your problem should name a model or model family, explain why, note a relevant trade-off, and flag one thing they don’t know yet that affects the approach. Length without specificity signals that nobody technical read your brief.
3. They’re charging for discovery before any code runs.
A discovery phase costing $15K-30K over 4-6 weeks before any development starts is usually a euphemism for “we need to figure out what we’re building.” Real AI teams scope fast. At Kalvium Labs, we prototype your core use case in 72 hours, not because we’re rushing but because prototyping is how we scope. Paying to discover what you’re buying is backwards.
4. The timeline is fixed before they’ve seen your data.
“12 weeks to production” before they’ve seen your dataset, your API constraints, or your production environment is sales math, not engineering math. The right answer to “how long will this take?” is: “here are three unknowns we need to resolve in the first two weeks, and here’s how the timeline shifts depending on what we find.” Certainty before discovery is a warning, not a comfort.
Red Flags in the Technical Conversation (5–8)
5. Nobody can name the specific model for your use case.
Ask directly: “what model would you use for this problem, and why?” A good answer names something specific. Claude 3.5 Sonnet because you need complex reasoning and cost-per-token matters. Deepgram Nova-2 because real-time transcription latency is the binding constraint. GPT-4o with a retrieval layer because your context window requirements exceed what the alternatives handle at that price point. “We’d evaluate options” isn’t an answer. It means they haven’t built this before.
6. The technical person answering your questions won’t build your product.
This pattern shows up most visibly in mid-sized agencies with a sales layer. You have a detailed technical conversation with someone who clearly knows their stuff. You sign. Week two, you’re introduced to the “delivery team.” The person you were impressed by is now unavailable. Ask before you sign: “Who specifically will build this, and can I meet them?” If that’s not possible, you don’t know what you’re buying.
7. Their “prototype” is a Figma mockup or a slide deck.
There’s a meaningful difference between showing you what an AI feature could look like and showing you what it actually does. Polished UI mockups tell you about design skills. A rough working demo tells you about technical instincts. If you’re paying for AI development, you need the latter. Any team that’s built a relevant system before can demonstrate it; a team that hasn’t will show you slides.
8. They agree with everything in your spec.
A team that doesn’t push back on any part of your scope either hasn’t thought about it carefully or isn’t going to tell you the hard things. Real AI teams have opinions. “The retrieval approach you’re describing will hit latency issues at your query volume; here’s what we’d change” is a green flag. “Sounds great, we can build that” to every item is not. Agreement without qualification usually means problems surfacing later, on your dime.
Red Flags in Commercial Terms (9–10)
9. Purely hourly billing with no outcome gates.
Hourly billing in AI development means the vendor earns more when things go wrong. Scope creep, model iterations, debugging cycles: all of these generate hours. You want a commercial structure where the vendor has skin in the outcome. Milestone-based billing, pod-based retainers tied to deliverables, or fixed-bid with clearly defined done criteria all work better than open-ended hourly. Pure hourly is the structure that maximizes agency revenue on a struggling project. For what realistic pricing looks like across project sizes, this breakdown of actual AI development costs covers the numbers most vendors won’t give you upfront.
10. “Done” isn’t defined anywhere in the contract.
What does the deliverable look like? What performance threshold does it need to hit? How do you verify it? If none of this is in writing, you’re paying for effort, not results. Any serious AI agency should be willing to define success criteria before the build starts. If they’re resistant to this, ask why.
Red Flags in Their Track Record (11–12)
11. Every case study is confidential with nothing verifiable.
Some level of client anonymization is normal, even expected. But an AI agency with no live public demos, no named clients willing to take a reference call, no GitHub repos showing prior work, and no working tool you can try is an agency with no verifiable track record. Possible they’ve done great work under strict NDAs. Also possible they’re overstating what they’ve shipped. Without any verifiable evidence, you can’t distinguish the two.
12. They talk about AI’s potential, not your specific problem.
The clearest tell of all. “AI is transforming every industry” and “we help companies unlock the power of AI” are marketing sentences. “Given that you have 18 months of call recordings and a known compliance rubric, we’d build a pipeline that does X, and the hard part is Y” is an engineering sentence. One of these indicates the person has built something similar before. The other indicates they’ve read about it.
What a Clean Vendor Conversation Looks Like
For contrast, here’s what the positive signals look like.
They show you a working demo of something adjacent to your problem before you sign anything. Not a marketing video, not a recorded walkthrough, but a rough working system that demonstrates they’ve solved a nearby technical problem.
They push back on at least one thing in your spec. “Your timeline is aggressive given what we know about your data” or “that architecture will hit issues at your query volume; here’s what we’d do instead.”
When you ask what might fail, they have a specific list. Not “AI is uncertain” but “the latency on the retrieval step might not hit your threshold, here’s what we’d do if it doesn’t.”
They name real tools and models. You can Google those names, read the docs, verify they’re appropriate for your use case.
The One Question That Filters Most Bad Agencies
I’ve started asking this in every vendor conversation: “What percentage of your builds go live within the originally scoped timeline, and what caused the ones that didn’t?”
Good agencies can answer this. Not perfectly. Nobody has 100% on-time delivery. But they can name a specific project, explain what went wrong, and tell you what changed because of it.
Agencies that can’t answer this haven’t been tracking their failure rate. That usually means the clients who got burned moved on without making noise.
For the separate question of whether to hire AI engineers vs. use an agency at all, the math breaks down differently than most Series A founders expect.
FAQ
How do I verify that an AI agency has actually built what they claim?
Ask for a live demo, not a recorded video. Ask whether any clients have given reference permission (most agencies have at least one or two willing to take a call). Search their engineering leads on LinkedIn and GitHub for relevant prior work. Request a 30-minute technical call with the person who will build your specific project, not the person selling it. If none of these are possible, treat the track record as unverified.
What’s a reasonable discovery phase for an AI project?
A scoping sprint of 1-2 weeks that ends with a working prototype demonstrating the core technical hypothesis is reasonable. Anything that costs significant money without producing running code is worth questioning. McKinsey’s research on AI project success rates consistently shows that the biggest failure mode is inadequate upfront alignment on feasibility, not a lack of discovery budget. A prototype clarifies feasibility faster than a requirements document.
What should an AI agency contract include to protect me?
At minimum: a clear definition of done with measurable criteria, milestone-based payments tied to deliverables (not hours logged), IP ownership that vests when you pay, and a handoff clause requiring working documentation of the codebase when the engagement ends. Review the scope-change process carefully; that’s where runaway billing usually starts.
How do I evaluate an AI agency if I’m not technical?
Bring someone technical to at least one conversation: a CTO advisor, a technical angel, or a trusted engineer. Ask your network whether anyone has worked with the agency. Require a paid proof-of-concept sprint ($3K-8K, 1-2 weeks) that produces something working before committing to the full build. Any legitimate AI agency should be willing to do this. The ones that aren’t are the ones you shouldn’t hire.
What’s the difference between an AI agency and an AI product studio?
An agency typically takes a fixed scope, builds it, and hands it off. A product studio works like an embedded team, with the same engineers across multiple sprints, iterating toward product-market fit rather than executing a spec. Which is better depends on how defined your problem is. If you know exactly what you’re building and need execution, agency structure works. If you’re still discovering what the AI feature should actually do, a studio model handles iteration better.
Building an AI feature and want a direct conversation about whether we’d take it on, including what we’d push back on in your scope? Book a 30-minute call. We prototype before we propose.