In sprint 1, the RAG system we built for a document intelligence client cost roughly $140 per month to run. Modest corpus, 500 daily queries, GPT-4o-mini on retrieval, pgvector on a $50 VPS. The client was happy. The demo was solid.
By sprint 3, the same system cost $890 per month. User load had grown 2.6x. But the bill had grown 6.4x.
The extra $750 wasn’t one line item. It was five of them, each invisible in the sprint 1 scope. None of them showed up in the SOW. Most of them only become visible when you’re in month three and trying to explain to a founder why the invoice doesn’t match the original estimate.
This post is about those five line items.
It’s not a tutorial on how to build RAG. If you want that, our earlier post on RAG in production covers architecture, chunking strategies, and why we switched away from Pinecone. This post is specifically for founders who already have a RAG system running and want to understand why the operating cost is growing faster than the usage. Or for founders evaluating a RAG proposal and trying to stress-test the cost model before they sign.
What the Sprint 1 Cost Model Misses
A standard RAG system has four infrastructure components: an embedding model, a vector store, an LLM for generation, and application compute (API server, orchestration, etc.). Most sprint 1 proposals estimate each of these honestly, at sprint-1 scale.
The problem is that two of those components have non-linear cost curves. And one has a hidden labor cost that doesn’t show up in infra at all. Sprint 1 estimates are almost always written at a scale that sits in the flat part of each curve. By sprint 3, you’ve usually crossed at least one inflection point.
I’ve reviewed sprint 1 estimates from five different vendors over the last year. Four of them shared the same structural gap: the cost model was accurate for the demo load and wrong by sprint 3 for the reasons below.
Here’s what that looks like in practice.
Surprise 1: Embedding Re-Indexing Is Charged Per Token, Every Time
When you build a RAG system, you embed your corpus once and store the vectors. That’s the setup cost. After that, embeddings feel “free” because you’re just querying the index, not re-embedding.
But you will re-index. Almost every production RAG system goes through at least one full re-index between sprint 1 and sprint 3.
The most common triggers: chunking strategy changes (the 512-token chunks you started with turn out to be too coarse for your use case), embedding model upgrades (OpenAI releases a better or cheaper model, or you switch from text-embedding-3-small to text-embedding-3-large for better recall), and corpus expansions that require restructuring the metadata schema.
The math is straightforward and easy to forget: OpenAI’s text-embedding-3-small costs $0.02 per million tokens. At 500K documents, 400 words each, you’re around 200 million tokens per full re-index. That’s $4 per run. Sounds fine.
Now add metadata enrichment (you generate summaries or extract entities to store alongside the chunk), switch to text-embedding-3-large ($0.13/million tokens), and do this twice in a three-month period. You’re at $50-80 in embedding fees alone, from two events that weren’t in the original estimate.
If your corpus is larger (5M docs, which is not unusual for enterprise document intelligence), these numbers scale proportionally. We’ve seen sprint-to-sprint re-indexing costs hit $400-600 for clients in that range. I’d add a re-indexing reserve line item to any RAG budget that involves a document corpus likely to grow or evolve in schema.
What to ask your vendor: “What triggers a full re-index, and what does that cost at 2x our current corpus size?”
Surprise 2: k-Value Creep Multiplies Per-Query Token Spend
When you configure a RAG retriever, k is the number of chunks you retrieve per query. You pass those k chunks as context to the LLM for generation.
Sprint 1 teams often start at k=3. Three chunks, each 400-500 tokens, gives you a reasonable context window for a 1,000-token prompt budget.
By sprint 3, the team has bumped k to 6 or 8. Why? Because retrieval precision wasn’t great. Some queries were missing the relevant chunk. Bumping k improved recall. It feels like a configuration tweak, not a cost decision.
At k=8 with 500-token chunks, your context contribution is 4,000 tokens per query (before the system prompt and the question). With GPT-4o at $2.50 per million input tokens, that’s $0.01 per query. At 500 queries/day, that’s $5/day or ~$150/month, just on the context component. Compared to $1.50/month at k=3.
That’s a 100x increase in per-query context cost from a config change.
In practice it’s not quite that clean, because your query and response tokens also scale, and you probably mixed in k=5 for a while. But the underlying dynamic is real: every increment in k is a linear multiplier on your input token bill. Nobody adds it to the sprint estimate because it feels like a tuning parameter, not a cost lever. When I review RAG systems for clients, k-value is the first config I check when the LLM bill looks higher than expected.
What to ask your vendor: “What’s our current k-value, and what’s the per-query cost difference between k=3 and k=8?”
Surprise 3: Vector DB Pricing Hits Non-Linear Tiers
This one surprised us the first time we ran into it, because the pricing pages look linear until they’re not.
Most managed vector databases (Pinecone, Weaviate Cloud, Qdrant Cloud) have pricing that scales with the number of vectors stored plus query volume. The per-vector cost looks stable in the low tier.
What doesn’t look stable: the jump between starter and production tiers. Pinecone’s serverless pricing, for example, is structured to be cheap at low vector counts and then the pricing structure changes meaningfully as you scale. Weaviate’s SaaS tiers have similar inflection points. The exact numbers shift as these companies update pricing, but the pattern doesn’t: there’s a meaningful cost jump when you cross the tier boundary, and most sprint 1 estimates are written inside the cheap tier.
If you start at 500K vectors (sprint 1) and grow to 8M vectors (sprint 3), you’ve likely crossed a pricing tier. The cost increase isn’t proportional to the vector count increase.
The alternative is self-hosting (pgvector on Postgres, or a self-hosted Qdrant instance). We’ve used pgvector extensively for clients where the vector count is under 20M and query latency requirements are under 200ms. At that scale, a $50-100/month VPS with pgvector is meaningfully cheaper than managed options, and the operational complexity is lower than it used to be with modern pgvector HNSW indexing. But “switch to pgvector” is an architectural decision, not a configuration change. If your sprint 1 design was built around Pinecone’s metadata filtering, that migration isn’t free.
What to ask your vendor: “At 5x our current vector count, does the vector DB pricing tier change? What’s the next tier’s monthly cost?”
Surprise 4: Eval Pipeline Maintenance Is a Labor Cost
This one doesn’t show up on the infra bill at all. It shows up in billable hours.
In sprint 1, evaluation is usually informal: the team manually checks a sample of queries, the client reviews a demo, everyone agrees it looks good. This is fine for sprint 1.
By sprint 3, the system has been in production for two months. There are edge cases. A category of queries the retrieval logic handles poorly. A document format that produces bad chunks. To catch regressions before deployment, you need a test set. Which means someone has to build and maintain it.
A minimal eval pipeline for RAG has three components: a set of test queries with expected outputs (or at least expected source documents), a scoring function (RAGAS, custom scoring, or human eval), and a CI check that blocks deploys when scores drop below threshold. RAGAS automates a lot of this, but it still requires: initial test set curation (engineering time), threshold calibration (engineering time), and ongoing updates as the corpus changes (ongoing engineering time).
If this is done by your vendor, it’s billable. If it’s done by your internal team, it’s not on the invoice, but it’s a real cost. We’ve seen this run 4-8 hours per month for modest RAG systems. That’s not nothing, and it wasn’t in the sprint 1 scope.
What to ask your vendor: “Who maintains the eval pipeline after sprint 2, and what does ongoing maintenance look like?”
Surprise 5: Context Stuffing Cancels the Savings from Model Downgrades
This is the most counterintuitive one.
By sprint 2, the team has realized that GPT-4o is expensive. The natural response: switch to GPT-4o-mini. At $0.15/M input tokens vs $2.50/M, it looks like a 16x cost reduction.
The problem: GPT-4o-mini’s reasoning is weaker. For knowledge-intensive RAG queries (the kind where the answer requires synthesizing multiple chunks), it produces lower-quality answers than GPT-4o. The team’s response: increase the context window. Send more chunks. Add more explicit retrieval instructions in the system prompt. Maybe add a chain-of-thought prompt to compensate.
Each of these changes increases token count. By the time you’ve tuned GPT-4o-mini to produce outputs that are “good enough,” you’re often sending 2-3x more tokens than you were with GPT-4o. The per-token cost is 16x lower, but the token count is 3x higher. Net savings: roughly 5x. Better than nothing, but not the 16x the pricing page implied.
We’ve run this calculation for three separate clients. The actual savings from switching to a mini model, after accounting for context and prompt engineering overhead, came in at 3-6x, not 16x. The 3-6x is still worth doing if cost is the constraint. Just don’t plan around 16x.
What to ask your vendor: “If we switch to a smaller model, what’s the expected impact on context requirements and output quality, and what does the net cost change actually look like?”
The Full Sprint 1 to Sprint 3 Cost Breakdown
To put numbers on all of this: for the document intelligence client mentioned at the top, here’s where the cost actually went between sprint 1 and sprint 3.
| Line item | Sprint 1 (mo) | Sprint 3 (mo) | Change |
|---|---|---|---|
| Embedding queries (production) | $8 | $21 | 2.6x usage growth |
| Re-indexing (events during period) | $0 | $94 | Two full re-indexes (chunking + model upgrade) |
| Vector DB (Qdrant Cloud, managed) | $25 | $148 | Tier jump at 4M vectors |
| LLM generation (GPT-4o → mini with context increase) | $42 | $67 | k bumped 3→7, mini model, net 1.6x despite model downgrade |
| Eval pipeline (internal hours, not invoiced) | 0 hrs billed | ~6 hrs/mo | Not on bill, real cost |
| Application compute | $65 | $160 | 2.5x usage growth, added reranker |
| Total (infra) | $140 | $490 | 3.5x infra |
| Total including re-indexing events | $584 | 4.2x with events |
User load grew 2.6x. Infra cost grew 3.5x. With re-indexing events counted: 4.2x. The gap between load growth and cost growth is those five surprises, not usage.
The client had budgeted for “roughly linear” cost growth with usage. The budget shortfall was about $350/month, which created a mid-sprint conversation about scope. Not catastrophic, but avoidable.
What a More Honest Sprint 1 Estimate Looks Like
When we scope RAG engagements now, the estimate has two cost columns: initial monthly cost at sprint 1 scale, and a 6-month cost model that includes re-indexing assumptions, a k-value range, the expected vector count growth trajectory, and a note on eval maintenance ownership.
It makes the proposal longer and less clean. But it’s more useful for a founder who’s trying to build a 12-month cost model, not just get through the demo.
If you’re evaluating a RAG proposal and it doesn’t include at least a ballpark on re-indexing frequency and vector DB tier assumptions, those are worth asking for explicitly. The vendor might not know. But asking forces them to think about it, which is better than finding out in sprint 3.
For what it’s worth: our Advanced RAG build story has more on the architecture choices. And if you’re still deciding between building RAG vs fine-tuning vs prompt engineering for a specific use case, that’s a different decision framework that starts with your data size and query distribution.
FAQ
How much does RAG cost to run in production per month?
Sprint 1 systems with modest usage (under 1,000 daily queries, under 1M vectors) typically run $100-300/month in infra. That number grows non-linearly as you scale: vector count, retrieval k-value, and re-indexing events each add cost independently of query volume. A system at 10M vectors with 5,000 daily queries and occasional re-indexes runs $600-1,500/month, depending on the vector DB choice and LLM tier. The wide range is because these variables interact.
Is it cheaper to build RAG with pgvector vs a managed vector database like Pinecone or Qdrant Cloud?
For corpora under 20M vectors and query latency requirements under 200ms, pgvector on a self-hosted Postgres instance is meaningfully cheaper than managed options. A $100/month VPS handles most small-to-medium RAG workloads. The tradeoff is operational complexity: you manage index maintenance, backups, and scaling yourself. Managed vector DBs are easier to operate but have tier-jump pricing that becomes significant at scale. For most startups below 10M vectors, pgvector is worth evaluating seriously.
How often do production RAG systems need to be re-indexed?
Most production systems go through 1-3 full re-indexes in the first six months: once for chunking strategy refinement, often once for an embedding model upgrade, and sometimes once for a metadata schema change. After that, partial re-indexing (new documents only) is the norm unless you change the embedding model again. Budget for 2-3 re-indexing events in your first year, at the embedding API cost for your full corpus size.
What is k-value in RAG, and how does it affect cost?
k is the number of document chunks returned by the retrieval step and passed as context to the LLM. Higher k improves recall (you’re less likely to miss the relevant chunk) but increases the input token count proportionally. At k=3 with 500-token chunks, you’re adding roughly 1,500 tokens of context per query. At k=8, you’re adding 4,000 tokens. The LLM input cost scales with k, so a change from k=3 to k=8 roughly doubles your per-query LLM cost (context tokens are the largest single input token source in a typical RAG query). Most teams don’t realize this is a cost decision when they make the change.
When should I switch from a managed vector database to self-hosted?
The main signal is hitting a pricing tier jump on your managed DB. If your vector count is crossing 5-10M and your managed DB bill is growing faster than your usage, the migration to self-hosted pgvector or Qdrant usually pays back within 2-3 months. The migration itself takes 2-3 engineer days for a straightforward system, more if you rely on managed-DB-specific features like multi-tenancy or filtered search. Run the math at your current corpus growth rate to find the crossover point.
Running a RAG system in production and trying to figure out where the cost went? We run these cost audits as part of our sprint reviews. Book a 30-minute call and we’ll look at your current setup with you.