Technical

April 6, 2026 · 15 min read

Fine-Tuning vs RAG vs Prompt Engineering: When to Use What

When to use fine-tuning vs RAG vs prompt engineering in production. Decision framework, cost data, and real examples from 11 AI projects.

Anil Gulecha

Ex-HackerRank, Ex-Google

Fine-Tuning vs RAG vs Prompt Engineering: When to Use What

TL;DR

Start with prompt engineering. Add RAG when you need domain knowledge. Fine-tune only when you've proven the use case and need to optimize cost at volume.
Fine-tuning teaches a model how to respond, not what to know. Using it to inject domain knowledge is the most common and most expensive mistake.
RAG reduced our compliance system's prompt from 12,000 tokens to 3,000 and improved accuracy, because the model got richer context per rule instead of thin summaries of 40 rules.
Most production systems combine approaches. Our compliance pipeline uses all three: prompt routing, RAG for context retrieval, and a fine-tuned classifier for high-volume pass/fail decisions.

Three months ago, a client asked us to fine-tune GPT-4o on their sales transcripts. We told them not to.

Their data was good. 4,200 labeled call transcripts with compliance scores. They had the budget. The use case seemed like a textbook fit. But when we dug into what they actually needed, RAG with a well-structured prompt pipeline solved the problem at 1/10th the cost and shipped in two weeks instead of eight.

Fine-tuning vs RAG vs prompt engineering isn’t a quality ranking. It’s a trade-off space. Each approach solves a different failure mode, and picking wrong costs you weeks or months. Across 11 AI software development projects over the past year, we’ve used all three, sometimes in the same system. Here’s how we decide.

The Decision in 30 Seconds

Approach	Best When	Cost to Build	Cost to Run	Time to Production
Prompt engineering	Starting out, fewer than 100 reference examples, general tasks	Low	High (long prompts)	Days
RAG	Knowledge changes, large corpus, need citations	Medium	Medium	1-3 weeks
Fine-tuning	Consistent style/format, 500+ training examples, high volume	High	Low (shorter prompts)	4-8 weeks

If you’re a startup trying to validate an idea, start with prompt engineering. If you need accuracy on your own data, add RAG. Fine-tune only when you’ve proven the use case works and need to optimize cost or consistency at scale.

Prompt Engineering: Fast to Build, Fragile at Scale

Prompt engineering is the zero-infrastructure approach. No training data. No vector database. No fine-tuning jobs. You write instructions, add examples to the prompt, and call the API.

For a compliance scoring system, a prompt-only approach looks like this:

SYSTEM_PROMPT = """You are a sales call compliance evaluator.

Score the following transcript against these rules:
1. Agent must state their name and company within the first 30 seconds
2. Agent must not make guarantees about returns or performance
3. Agent must read the risk disclosure before the call ends
4. Agent must ask for verbal consent before processing

For each rule, output:
- rule_id: (1-4)
- passed: (true/false)
- evidence: (exact quote from transcript, or "not found")
- confidence: (0.0-1.0)

Output as JSON array."""

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": f"Transcript:\n{transcript}"}
    ],
    response_format={"type": "json_object"}
)

This gets you to 80% accuracy on day one. No training data, no infrastructure. For a prototype, that’s enough.

Where It Breaks

Prompt engineering hits three walls in production:

Wall 1: Prompt length compounds cost. Every example, every rule, every edge case goes into the prompt. A compliance system with 40 rules and 10 few-shot examples burns 8,000-12,000 tokens per call just in the system prompt. At GPT-4o’s $2.50/million input tokens, that’s $0.025 per call. At 5,000 calls/day, you’re spending $125/day before the model even reads the transcript.

Wall 2: Models drop instructions past a threshold. We measured this on our compliance system. Rules 1 through 10 were followed 94% of the time. Rules 11 through 20 dropped to 82%. Rules 21 through 35 hit 61%. Adding more instructions to a prompt can make the system less reliable, not more. We covered this failure mode in detail in our prompt architecture post.

Wall 3: Consistency degrades on subjective judgments. Temperature 0 doesn’t mean deterministic. The same transcript scored through the same prompt twice will occasionally produce different results. On judgments like “was the agent’s tone professional,” variance across runs hits 15-20%. For a prototype demo, nobody notices. For a production system processing 5,000 calls daily, that variance surfaces in QA reports.

When to Stay Here

Prompt engineering alone is the right call when you have fewer than 20 rules or instructions, you’re validating whether the use case works at all, volume is under 500 requests/day, and your reference data fits in 5 few-shot examples.

RAG: When Your Data Is the Product

Retrieval-Augmented Generation solves a specific problem: the model needs information it wasn’t trained on. Your company’s internal docs. Your product catalog. Your compliance policies. Your customer’s historical data.

The architecture:

User query → Embed query → Search vector DB → Retrieve top-k chunks
    → Inject into prompt → Generate answer

For the compliance system, RAG changed the economics and the accuracy simultaneously. Instead of cramming 40 rules into the system prompt, we embedded all compliance rules and their interpretive guidance into Qdrant, retrieved only the 5-8 rules most relevant to each transcript, and passed those rules with full explanatory context to the model.

The prompt shrank from 12,000 tokens to 3,000 tokens. Accuracy went up because the model got richer context per rule instead of thin summaries of 40 rules. Cost per call dropped 60%.

The Real RAG Stack

Here’s what we deploy in production, not the demo version with 50 documents:

from qdrant_client import QdrantClient
from openai import OpenAI

# Embedding model: text-embedding-3-small ($0.02/1M tokens)
# Not text-embedding-3-large. The quality difference on
# sub-2000 token chunks is ~2% on our eval set. Cost is 5x.
embed_client = OpenAI()
qdrant = QdrantClient(url="https://your-cluster.qdrant.io")

def retrieve_context(query: str, collection: str, top_k: int = 5):
    query_vec = embed_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    results = qdrant.search(
        collection_name=collection,
        query_vector=query_vec,
        limit=top_k,
        score_threshold=0.72  # Below this, retrieval quality tanks
    )

    return [
        {
            "text": hit.payload["text"],
            "source": hit.payload["source"],
            "score": hit.score
        }
        for hit in results
    ]

The score_threshold=0.72 isn’t arbitrary. We calibrated it on 200 test queries. Below 0.72, retrieved chunks are topically adjacent but not actually relevant. Above 0.85, you miss valid results because the threshold is too strict. The sweet spot varies by embedding model and domain, so run your own calibration.

Where RAG Breaks

Retrieval failure is silent. When the vector search returns irrelevant chunks, the model doesn’t say “I couldn’t find the answer.” It fabricates one using the irrelevant context as evidence. This is worse than a base-model hallucination because the user trusts the answer more since it appears to come from their data.

We mitigate this with a relevance gate:

def check_relevance(query: str, chunks: list[dict]) -> list[dict]:
    """Filter chunks below relevance threshold."""
    relevant = [c for c in chunks if c["score"] >= 0.72]
    if not relevant:
        return []  # Better to say "I don't know" than hallucinate
    return relevant

Chunking is the real engineering problem. Everyone focuses on the vector database and the embedding model. The actual hard part is splitting your documents correctly. Too small and you lose context. Too large and retrieval precision drops. We’ve settled on 400-600 token chunks with 50-token overlaps for most document types. Legal documents need 800-1000 token chunks because clauses reference each other across paragraphs. There’s no universal setting.

Latency adds up. Embedding the query: 50-100ms. Vector search: 20-50ms. Context injection and generation: 500-1500ms. Total: 600-1600ms per request. For real-time chat, acceptable. For batch processing 10,000 documents, that retrieval overhead adds 15-30 minutes to the pipeline compared to a prompt-only approach.

For a detailed comparison of vector databases in production, see our vector databases comparison.

When RAG Is the Right Choice

Your knowledge base exceeds what fits in a prompt (roughly 20+ documents or 50K+ tokens). Information changes frequently. You need citations or source attribution. Accuracy on domain-specific facts matters more than output style consistency. You don’t have 500+ labeled examples for fine-tuning.

Fine-Tuning: The Most Misunderstood Option

Fine-tuning changes the model’s weights. You’re not giving it instructions at inference time. You’re changing how it behaves by default.

This sounds powerful. It is. It’s also the most expensive approach to build, the slowest to iterate on, and the easiest to get wrong.

When Fine-Tuning Actually Makes Sense

We’ve fine-tuned models on three client projects. Two were justified. One was a mistake.

Justified: brand voice at scale. A client needed 12,000 product descriptions per month in a specific brand voice. The brand guidelines were 47 pages. Stuffing those guidelines into every prompt cost $0.03 per description in prompt tokens alone, $360/month just for the system prompt. We fine-tuned GPT-4o-mini on 800 approved examples. Results:

Metric	Prompted (base model)	Fine-tuned
Prompt tokens per call	~4,000	~600
Cost per description	$0.035	$0.004
Monthly cost (12K descriptions)	$360	$48
Human review pass rate	88%	88%

The fine-tuning job cost about $120 (800 examples, 3 epochs). Paid for itself in the first week.

Justified: proprietary taxonomy classification. A 200-category product taxonomy that didn’t map to any public classification system. Prompt engineering worked but required 5,800 tokens of category definitions per call. Fine-tuning on 2,000 labeled examples produced a model that classified at 91% accuracy with a 200-token prompt.

Mistake: domain Q&A. A client wanted to fine-tune on internal documentation so the model would “know” their product. This is the most common fine-tuning misconception. Fine-tuning teaches a model how to respond, not what to know. The fine-tuned model hallucinated about product features just as confidently as the base model. It just hallucinated in the brand voice. RAG was the right answer. Took us two weeks to realize the mistake and rebuild.

The Fine-Tuning Process

OpenAI’s fine-tuning API is the most mature option. Here’s what data preparation actually looks like:

import json

training_examples = [
    {
        "messages": [
            {
                "role": "system",
                "content": "Generate a product description."
            },
            {
                "role": "user",
                "content": "Product: Ergonomic standing desk, oak finish, "
                           "motorized height adjustment, 60x30 inches"
            },
            {
                "role": "assistant",
                "content": "The Oak 60 adjusts from sitting to standing "
                           "height in 4.2 seconds..."
            }
        ]
    },
    # ... 799 more examples
]

with open("training_data.jsonl", "w") as f:
    for example in training_examples:
        f.write(json.dumps(example) + "\n")

from openai import OpenAI
client = OpenAI()

file = client.files.create(
    file=open("training_data.jsonl", "rb"),
    purpose="fine-tune"
)

job = client.fine_tuning.jobs.create(
    training_file=file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={"n_epochs": 3}
)

Training data quality matters more than quantity. 200 high-quality examples beat 2,000 noisy ones. We spend more time curating training data than on any other part of the process. Every example goes through human review. Inconsistent examples teach the model to be inconsistent.

Eval before and after. Hold out 15% of examples for evaluation. If the fine-tuned model doesn’t outperform the prompted base model on the held-out set, don’t ship it. On one project, the first training run scored lower than the base model because the training data had contradictory examples. Two rounds of data cleaning fixed it.

The Cost Math

Phase	GPT-4o-mini	GPT-4o
Training (1,000 examples, 3 epochs)	~$25	~$80
Inference input ($/1M tokens)	$0.30	$5.00
Inference output ($/1M tokens)	$1.20	$15.00
Time to first usable model	2-3 hours	3-5 hours
Time to production-quality model	1-3 weeks	1-3 weeks

The training cost is trivial. The real cost is 1-3 weeks of data preparation, training iteration, and evaluation. That’s engineering time, not API spend.

The Hybrid Stack

Most production systems use more than one approach. Our compliance pipeline uses all three:

User query
    |
    v
Prompt routing (prompt engineering)
    |
    v
Retrieve relevant rules + transcripts (RAG)
    |
    v
Score compliance (fine-tuned classifier for pass/fail)
    |
    v
Generate report (prompt engineering with retrieved context)

The fine-tuned classifier handles the high-volume binary decision (compliant/non-compliant) at low cost. RAG provides the context. Prompt engineering orchestrates the pipeline and generates human-readable output.

Pattern	When to Use
Prompt engineering + RAG	Most common combo. Knowledge-intensive tasks without training data.
Fine-tuning + prompt engineering	High-volume tasks with consistent output format. Fine-tune for style, prompt for task specifics.
Fine-tuning + RAG	Fine-tune for output behavior, RAG for dynamic knowledge. Most complex to maintain.
All three	Multi-model pipelines where each call is optimized differently.

The Decision Framework

Here’s the decision tree we walk through at project kickoff:

Do you have 500+ labeled training examples?
|
+-- No --> Does the task need domain-specific knowledge?
|          |
|          +-- No --> Prompt engineering. Start here.
|          +-- Yes -> RAG. Build a retrieval pipeline.
|
+-- Yes -> Is the task about *style/format* or *knowledge*?
           |
           +-- Style/format --> Fine-tuning
           +-- Knowledge -----> RAG
           +-- Both? ---------> Fine-tuning + RAG hybrid

Volume matters too. Below 500 requests/day, prompt engineering handles almost anything. The per-request cost of a long prompt is negligible at low volume. Above 5,000 requests/day, the 85% token reduction from fine-tuning translates to real savings.

What We Got Wrong

Trying to fine-tune away hallucinations. One client believed that fine-tuning on correct answers would eliminate fabrication. It doesn’t. Fine-tuning adjusts the distribution of outputs. It doesn’t give the model access to information it hasn’t seen. If the base model doesn’t know your product’s pricing tiers, the fine-tuned model won’t know them either. It’ll just state wrong prices with more confidence. RAG is the answer to hallucination, not fine-tuning.

Over-engineering RAG for small corpora. For a knowledge base under 50 documents, skip the vector database. Stuff the documents into context. Claude 3.5 Sonnet handles 60-80K tokens reliably. Setting up Qdrant, writing chunking logic, tuning retrieval parameters: that’s 3-5 days of work to avoid a problem you don’t have yet. We spent a week building a RAG pipeline for a client with 23 policy documents. The context-stuffing approach we prototyped on day one worked better because the model could cross-reference between documents without retrieval boundaries. We’ve seen this same “lost-in-the-middle” caveat mentioned frequently, but for small document sets the cross-referencing benefit outweighs the attention degradation.

Treating the first choice as permanent. Your first approach should be the cheapest to build and validate. Prompt engineering first. If accuracy is insufficient, add RAG. If cost is too high at volume, consider fine-tuning. This isn’t waterfall planning. It’s iterative architecture, and the best AI software development teams treat it that way.

FAQ

Can I fine-tune Claude or only GPT models?

Anthropic offers fine-tuning for Claude through their API, and Google provides fine-tuning for Gemini models. OpenAI’s fine-tuning tooling is the most mature, with better evaluation integration and faster iteration cycles. For open-source models, you can fine-tune Llama 3.1 using LoRA adapters on a single A100 GPU at roughly $2-3/hour on cloud providers. Which model you fine-tune depends on which model you’re already running in production. Switching base models mid-project adds migration cost on top of the fine-tuning work.

For more on this, read our guide on LLM Selection for Production.

How much training data do I need for fine-tuning?

OpenAI’s minimum is 10 examples. The practical minimum for consistent results is 200-500. We’ve seen diminishing returns past 2,000 examples for most classification and generation tasks. Quality matters more than quantity. 300 carefully curated examples with consistent formatting outperform 3,000 noisy examples scraped from production logs. Budget 40-60 hours for data preparation on a fine-tuning project. That’s typically the largest time investment, not the training itself.

Is RAG still worth it now that context windows are 1M+ tokens?

Yes. Large context windows help when you need the model to reason across an entire document corpus simultaneously, like comparing provisions across 50 contracts. But for point lookups in large knowledge bases, retrieval is faster, cheaper, and more accurate. At 1M tokens of context, you’re paying $1.25-3.00 per query in input tokens alone depending on the model. Answer quality on specific factual questions also degrades past 40-80K tokens due to attention distribution effects. RAG with focused retrieval is both cheaper and more precise for most production workloads.

How do I evaluate which approach works for my use case?

Build a golden test set of 100-200 examples with expected outputs. Run all candidate approaches against the same test set. Measure accuracy (or your domain-specific quality metric), latency, and cost per request. The approach that meets your quality floor at the lowest cost per request wins. Don’t optimize for benchmark scores. Optimize for the quality threshold your use case actually requires. An approach hitting 85% accuracy at $0.002/request beats one at 92% accuracy and $0.05/request if your quality floor is 80%.

What’s the fastest path from idea to production?

Prompt engineering, always. Build a working prototype with prompts and few-shot examples in 1-3 days. This tells you whether the task is feasible at all. If accuracy is below your threshold, add RAG for domain-specific knowledge (1-2 weeks additional). If you’re past 5,000 requests/day and cost is the binding constraint, evaluate fine-tuning (4-8 weeks including data preparation). Each step should prove the previous one wasn’t sufficient before you invest in the next.

Choosing between fine-tuning, RAG, and prompt engineering is the first architecture decision in any AI build. Get it right and you save months. Book a 30-minute call and we’ll map your use case to the right approach before you write a line of code.

#fine-tuning#RAG#prompt engineering#ai software development#LLM#production AI#GPT-4o#vector databases

Stay in the loop

Technical deep-dives and product strategy from the Kalvium Labs team. No spam, unsubscribe anytime.

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

LinkedIn GitHub · About us →

You read the whole thing — that means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

Production AI on Cloudflare Workers: Architecture Guide

Technical

AI Evaluation Pipelines: Testing Your Model in Production

Have a question about your project?

Send us a message. No commitment, no sales pitch. We'll tell you if we can help.

Chat on WhatsApp Book a 30-min Call →

Fine-Tuning vs RAG vs Prompt Engineering: When to Use What

Building something with AI?

See how we've built this in production

Free: AI PRD Generator

The Decision in 30 Seconds

Prompt Engineering: Fast to Build, Fragile at Scale

Where It Breaks

When to Stay Here

RAG: When Your Data Is the Product

The Real RAG Stack

Where RAG Breaks

When RAG Is the Right Choice

Fine-Tuning: The Most Misunderstood Option

When Fine-Tuning Actually Makes Sense

The Fine-Tuning Process

The Cost Math

The Hybrid Stack

The Decision Framework

What We Got Wrong

FAQ

Can I fine-tune Claude or only GPT models?

How much training data do I need for fine-tuning?

Is RAG still worth it now that context windows are 1M+ tokens?

How do I evaluate which approach works for my use case?

What’s the fastest path from idea to production?

Stay in the loop

Anil Gulecha

Keep reading

Production AI on Cloudflare Workers: Architecture Guide

AI Evaluation Pipelines: Testing Your Model in Production

Have a question about your project?