A founder asked me last month: “How do I evaluate an AI agency without already being a CTO?”
She had three proposals on her desk. Two from US agencies at $80K each, one from an offshore team at $25K. The pitches sounded similar. The case studies sounded similar. The websites looked similar. She’d been told to compare on price, hours, and timeline. She had no idea which one would actually deliver.
I gave her seven questions. They came from running our agency, hiring at HackerRank, building infrastructure at Google, and being on the buying side, because I’ve evaluated agencies for projects I didn’t have time to do myself. The questions take about 30 minutes per agency to ask. They disqualified two of her three options before she got to pricing.
She picked the third one. Six weeks later it shipped.
This post is that checklist. Run it on every AI agency you evaluate, including us. The point is to make a bad fit visible in week one, not month two.
Why Most Agency Evaluations Go Wrong
Founders default to evaluating AI agencies the way they evaluate web dev agencies. That worked fine in 2018. It does not work in 2026.
In a web dev project, the deliverable is mostly known. You can compare React vs Vue, point estimates vs hourly rates, US team vs offshore. The execution risk is bounded. Most competent teams will produce something usable.
AI projects don’t have that property. The same prompt to the same model on the same day produces different output for different teams, because the surrounding architecture (eval pipeline, retry logic, structured output handling, prompt versioning, monitoring) determines whether the product is reliable or merely “works on the demo.” Two agencies can ship the “same” feature where one quietly fails on 10% of inputs and the other does not. The difference is invisible until production.
That is why the marketing-quality questions (“Show us your portfolio,” “What’s your team size,” “What’s your hourly rate”) fail. They cannot detect the difference. The questions in this post can. (And if you’re earlier in the process and still deciding between an AI agency vs AI product studio vs freelancers, that comparison is worth reading first.)
The Seven Questions
Question 1: Show Me Production Code from Your Last Build
Not a slide deck. Not a case study. Actual code. Either a public repo, a sanitized internal one, or a screen-share where they walk you through their last LLM application.
What you’re testing: whether they’ve shipped AI to production at all.
A good answer: they pull up a repo or screen-share within 24 hours. They show you prompt versioning, the eval pipeline, the retry/fallback logic for when the model returns something unparseable, and the monitoring setup. They use words like “we hit this issue in production” in past tense, with a specific failure they fixed.
A bad answer: they describe the project at a high level. They mention the libraries they used (LangChain, OpenAI, etc.) but can’t pull up the file where the prompt is defined. They show you the front-end demo. When you ask about evaluation, they say “the client tested it.”
In our experience, a meaningful share of AI agencies fail this question. They have a chatbot demo on their site, but they haven’t actually shipped a system that handles real production traffic. Move on.
Question 2: The Failed Project Test
“Walk me through your last failed project.”
This is the single most diagnostic question on the list. Anyone who has shipped AI to multiple clients has had a failure. The willingness to talk about it tells you everything.
What you’re testing: maturity, honesty, and learning loop.
A good answer: they describe a specific project, what went wrong, what they did to fix it, what they’d do differently. The story is uncomfortable to tell. They tell it anyway. Bonus signal: they refer to the client by industry rather than name (NDA discipline). Extra bonus: they describe a project they refunded or a contract they ended early.
A bad answer: “We’ve never had a failure” / “Every project is a learning experience” / “We had a difficult client once.” These are tells. Either the agency hasn’t shipped enough to have a failure, or they have but can’t acknowledge it. Both are reasons to walk away.
A real example from our side: in late 2024 we underestimated a RAG project for a fintech client by about 40%. The retrieval was working. The hallucination rate was not coming down. We pivoted to a hybrid retrieval setup with reranking, and we absorbed the additional engineering hours rather than billing the client. We tell that story because it’s true and because it’s what we wish other agencies would tell us when we’re evaluating partners. The fintech is still a client.
Question 3: The Eval Setup
“What does your eval setup look like for this project?”
If they don’t have a clear answer to this in the first call, they haven’t shipped reliable AI.
What you’re testing: technical depth and whether model output will actually meet your acceptance criteria.
A good answer: they describe a per-project eval set (real samples or synthetic), the metrics they’ll track (accuracy, latency, cost, hallucination rate, structured-output adherence), the tooling they use (LangSmith, Phoenix, Athina, custom eval scripts, LLM-as-judge with human verification), and the threshold below which they won’t ship. They might mention specific eval frameworks like the Evals library or Promptfoo. They probably show you a dashboard from a previous client.
A bad answer: “We test it with real users” / “We have an internal QA team” / “The model is GPT-4o, it just works.” These are not evaluation strategies. They’re excuses for not building one.
The eval question is the deepest technical filter on this list. Agencies that haven’t internalized AI evaluation will produce systems that work in demos and fail in production at exactly the moments the failure costs the most. There is no shortcut around this.
Question 4: The Day-to-Day Engineer Assignment
“Which engineer will I work with day-to-day, and can I meet them on this call?”
The bait-and-switch is the most common AI agency failure pattern. The senior architect runs the discovery call. A junior engineer ships the project. The senior architect bills $200/hour for work the junior engineer does at $40/hour.
What you’re testing: whether the technical capacity sold to you is the capacity you actually receive.
A good answer: they tell you the named engineer assigned, their LinkedIn or GitHub, what they’ve shipped, and they’ll get them on the next call. The PM is named. The CTO or technical founder is part of the architecture review for the build, not just sales.
A bad answer: “We assign engineers based on availability when the project starts.” Or: “Our engineers are interchangeable, they all follow our process.” Both are red flags. AI engineering is not interchangeable. The decisions made in week one (model selection, retrieval architecture, eval design) are made by whoever sits in those meetings. If that person changes between sales and delivery, you’re buying a different product than the one demoed.
At Kalvium Labs, every project has a named lead engineer at kickoff, on every client call, supervised by me on architecture and by a Technical Project Manager on delivery. If we ever propose otherwise, that’s the moment to push back.
Question 5: Scope Creep Handling
“How do you handle scope creep mid-build?”
Every AI build has scope creep. The model you started with launches a better version. The client’s requirements shift after seeing the prototype. The retrieval pattern that worked at 100 documents fails at 10,000. How an agency handles those moments determines whether you finish on time and on budget.
What you’re testing: whether they have a process for the inevitable scope conversation.
A good answer: they describe a specific framework. Written scopes, change orders, sprint boundaries, and a defined line for what triggers a re-quote. They have a template. They’ve done this before.
A bad answer: “We’re flexible. We just absorb small changes.” Or: “We bill T&M after the original scope, so creep is fine.” Both fail you. The first means they won’t push back when scope expands and they’ll quietly miss deadlines. The second means you’ve signed a blank check.
Question 6: Post-Launch Operating Cost
“What’s your operating cost on a typical client product after launch?”
Build cost is one number. Operating cost (LLM API bills, vector database, observability tooling, infrastructure) is the line item most founders forget to ask about.
What you’re testing: whether the agency has built systems that actually run in production, and whether they understand unit economics.
A good answer: they quote ranges from real clients. Something like: “A chatbot at 1,000 daily users runs $400 to $500 per month on GPT-4o, or $25 per month on GPT-4o-mini. A multi-step agent product is more expensive, typically $1,500 to $4,000 per month at the scale our SaaS clients run.” They mention specific cost-reduction patterns (model routing, embedding caching, prompt compression, batch processing). (For context, those chatbot numbers are roughly what you’d expect from token math at ~5 calls per user per day, ~700 tokens per call, on the published GPT-4o and GPT-4o-mini pricing. Real bills can be 2-3x higher if no caching or model routing is in place, which is exactly the failure mode the next paragraph describes.)
A bad answer: “It depends on usage.” Yes, it depends on usage. That’s not the question. The question is what the cost looks like at your usage level, and whether they’ve shipped systems where it stayed predictable.
A founder I talked to recently was running roughly $4,000/month on OpenAI for a product with a few hundred weekly active users. The previous agency had built it on GPT-4 with no caching, no model routing, and no eval-driven downsizing of prompts. Our intake review showed the same workload could run at roughly an order of magnitude lower cost with three changes: model routing to GPT-4o-mini for the routine requests (which were ~80% of traffic), embedding caching on the retrieval layer, and prompt compression on the high-volume endpoints. The build-time decision (use the smartest model everywhere) had quietly created an operating cost the founder hadn’t budgeted for. This is what you’re buying when you skip this question.
We wrote a detailed breakdown of how AI operating costs accumulate over time if you want the full picture.
Question 7: The 2-Week Ship Test
“If I paid you to ship something to production in 2 weeks, could you?”
This is the speed test. The right answer isn’t always “yes,” but the answer reveals how the agency thinks about scope vs ambition.
What you’re testing: whether they have an opinion about scoping for time, and whether they’ll push back when the timeline is impossible.
A good answer: “Depends on what you mean by production. We can ship a prototype that handles real users with eval coverage and a 95% success rate in 14 days. We can’t ship a product with full CRM integration, SSO, audit logs, and a model fine-tuning pipeline in 14 days. Tell me which version of ‘production’ you mean and I’ll tell you what we can do.”
A bad answer: “Sure, we can do anything in 2 weeks if you pay enough.” This means they’ll say yes to any scope at any timeline and quietly miss it. Or: “No, AI projects need at least 8 weeks.” This means they don’t know how to scope down.
The right opinion on speed exists. Most agencies don’t have one.
Summary Checklist
Use this table to score each agency. Three or more weak answers in column B is a disqualifying pattern.
| # | Question | Strong signal | Weak signal |
|---|---|---|---|
| 1 | Show me production code | Repo/screen-share within 24h | ”Let me put together a deck” |
| 2 | Last failed project | Specific story, specific fix | ”Every project is a learning experience” |
| 3 | Eval setup | Named metrics, tooling, thresholds | ”The client tested it” |
| 4 | Day-to-day engineer | Named person, meetable this week | ”We assign based on availability” |
| 5 | Scope creep | Written scope, change-order template | ”We absorb small changes” |
| 6 | Operating cost | Real ranges from real clients | ”It depends on usage” |
| 7 | 2-week ship | Opinionated scope-down | ”We can do anything if you pay enough” |
Applying This to Kalvium Labs
Run all seven on us. Specifically:
- Production code: I’ll send a sanitized repo or set up a screen-share within 24 hours of your first call. Pick the build you want to dig into (call analyzer, advanced RAG, assessment platform, or data analyst) and we’ll walk through the actual architecture.
- Last failed project: I told the RAG fintech story above. There are more. Ask.
- Eval setup: every project starts with an eval set before any prompt is written. We use a mix of LangSmith, Phoenix, and custom eval tooling depending on the surface area. We’ll show you the dashboards.
- Day-to-day engineer: named at kickoff, on every client call, supervised by me on architecture and by a TPM on delivery.
- Scope creep: every sprint has a written scope, every scope change is a tracked decision. We have a template for this.
- Operating cost: typical client product runs $300 to $2,500/month at the usage levels our SaaS clients see. We’ll quote against your usage estimates in the first proposal.
- 2-week ship: we run 72-hour prototype sprints for new projects. A working prototype with real input handling and 90%+ eval pass rate in 72 hours, then iterate from there.
We’d rather lose your deal in week one than two months in. This checklist is built to make that call fast.
FAQ
What should I ask an AI development agency before signing?
The seven questions above cover it. The three that matter most: “Show me production code from your last build,” “Walk me through your last failed project,” and “What does your eval setup look like?” Any agency that can’t answer all three with specifics hasn’t shipped reliable AI at scale. You’ll find out in one 30-minute call.
How much does it cost to hire an AI development agency?
For US and European agencies, expect $50K to $150K for a three-to-four-month engagement. Offshore agencies typically quote $15K to $40K for the same scope. But build cost is only half the number. Operating costs (LLM API, vector database, infrastructure) run $300 to $4,000/month depending on traffic and model selection. A good agency will give you both numbers upfront. If they can only quote build cost, that’s a gap in their experience with production systems.
What’s the difference between an AI agency and an AI product studio?
An agency typically charges hourly or per-project and works across many client types. An AI product studio (what we are at Kalvium Labs) is opinionated about the stack, owns the architecture decisions, and often works across fewer, more intensive engagements. In practice, the difference shows up in the eval setup: agencies tend to defer evaluation to clients; product studios build evaluation into the process because they’re accountable for production performance, not just code delivery.
How do I know if an AI agency has real production experience?
Ask question 3 and question 6 from the checklist. Agencies with production experience will give you specific numbers: hallucination rates, latency p99, operating cost per 1,000 requests, token counts per call. Agencies without it will describe what they built but can’t tell you how it performed. The Evals library and Promptfoo are specific tools to ask about. If they don’t know either, that’s a signal.
How long does it typically take to build an AI product?
A well-scoped AI prototype with real user inputs, eval coverage, and a 90%+ success rate typically takes one to two weeks. A production-ready product with monitoring, fallback handling, and integration to your existing systems typically takes six to twelve weeks depending on scope. The 8-to-12-week range is common for mid-complexity builds. What extends timelines isn’t model performance but integration complexity: connecting to your data, your auth, your existing APIs. That’s where AI projects tend to slip.
Want a 30-minute call to run this checklist on our work specifically? Book one here and we’ll walk through production code, our last failure, our eval setup, and the operating cost ranges for your use case. No proposal, no SOW, no signature required.