Technical
· 16 min read

Model Cost Optimization: Cut LLM Bills 80% in Production

How to cut LLM API costs by 80% without degrading quality. Model routing, prompt compression, caching, and batching patterns from production systems.

Anil Gulecha
Anil Gulecha
Ex-HackerRank, Ex-Google
Share
Model Cost Optimization: Cut LLM Bills 80% in Production
TL;DR
  • Model routing alone cuts costs 60-75% on most workloads: use a cheap model for classification and retrieval, reserve GPT-4o or Sonnet for reasoning-heavy steps.
  • Semantic caching with pgvector returns near-duplicate query results from a cache instead of re-running inference. We see 25-35% cache hit rates on most chatbot workloads.
  • Prompt compression reduces input token counts by 30-40% on long-document tasks without measurable quality loss, using LLMLingua or selective attention scoring.
  • Batching async tasks through off-peak scheduling (10 PM to 6 AM) and batch API endpoints saves 40-50% on non-interactive workloads.

LLM API costs follow a predictable pattern across every startup we’ve worked with. Month one: “this is affordable.” Month two: “traffic is growing, costs are up 3x.” Month three: the founder sends a message asking why the API bill is now larger than engineering salaries.

The answer is almost always the same. They picked one model and called it for everything. GPT-4o or Claude Sonnet for every request: the three-word customer support reply, the 5,000-token document analysis, the simple classification step in a pipeline that runs 50,000 times a day.

This is the LLM equivalent of using a freight truck to deliver envelopes. The cost structure breaks before the product scales.

We’ve shipped AI systems for startups where the final cost is 15-20% of what a naive single-model approach would have cost, with the same output quality on every task that matters to users. These are the four techniques that do most of the work.

Model Routing: The 80/20 of Cost Reduction

The highest-leverage change is the simplest: stop calling expensive models for tasks that don’t need expensive models.

In most AI systems, 70-85% of inference requests are doing one of four things: classification, summarization of short inputs, structured extraction from well-formatted data, or generating short deterministic responses to common queries. These do not require frontier reasoning models. They require a model that can follow instructions reliably and return a consistent output format.

Here is the routing layer we build into every system:

from enum import Enum
from dataclasses import dataclass

class TaskComplexity(Enum):
    TRIVIAL = "trivial"      # Classification, short extraction, routing
    STANDARD = "standard"   # Summarization, Q&A over context, drafts
    COMPLEX = "complex"     # Reasoning chains, code generation, synthesis
    CRITICAL = "critical"   # Legal/compliance/medical, final outputs

# Current cost reference (per 1M tokens, input/output)
MODEL_COSTS = {
    "claude-3-5-haiku-20241022":    {"input": 0.80,  "output": 4.00},
    "claude-3-5-sonnet-20241022":   {"input": 3.00,  "output": 15.00},
    "gpt-4o-mini":                  {"input": 0.15,  "output": 0.60},
    "gpt-4o":                       {"input": 2.50,  "output": 10.00},
    "llama-3.1-8b":                 {"input": 0.06,  "output": 0.06},   # Groq
    "llama-3.1-70b":                {"input": 0.59,  "output": 0.79},   # Groq
}

ROUTING_TABLE = {
    TaskComplexity.TRIVIAL:   "gpt-4o-mini",           # $0.15/M input
    TaskComplexity.STANDARD:  "claude-3-5-haiku-20241022",  # $0.80/M input
    TaskComplexity.COMPLEX:   "claude-3-5-sonnet-20241022", # $3.00/M input
    TaskComplexity.CRITICAL:  "claude-3-5-sonnet-20241022", # Non-negotiable
}

def classify_task(task_type: str, input_tokens: int, has_tool_calls: bool) -> TaskComplexity:
    if task_type in ("classify", "extract_field", "route_intent") and input_tokens < 500:
        return TaskComplexity.TRIVIAL

    if task_type in ("summarize", "qa_retrieval", "generate_reply") and not has_tool_calls:
        if input_tokens < 2000:
            return TaskComplexity.STANDARD

    if task_type in ("code_generation", "multi_step_reasoning", "synthesis"):
        return TaskComplexity.COMPLEX

    if task_type in ("compliance_check", "legal_review", "medical_content"):
        return TaskComplexity.CRITICAL

    # Default to standard, not complex
    return TaskComplexity.STANDARD

async def route_and_call(task_type: str, messages: list, tools: list = None) -> dict:
    input_tokens = estimate_tokens(messages)
    has_tools = bool(tools)

    complexity = classify_task(task_type, input_tokens, has_tools)
    model = ROUTING_TABLE[complexity]

    response = await call_model(model=model, messages=messages, tools=tools)

    # Log for cost tracking and routing accuracy review
    await log_routing_event(task_type, model, complexity, input_tokens, response.usage)

    return response

The numbers behind this: if you have a pipeline that runs 100,000 classification calls per day, the difference between routing them to gpt-4o-mini ($0.15/M) versus gpt-4o ($2.50/M) is $16.70 vs $250 per day on input tokens alone. The output savings stack on top.

Where routing fails. The classification step itself requires a model call, which costs tokens. If your tasks are all short and you’re routing to the same cheap model regardless, the overhead of the router might exceed the savings. We use routing when there’s genuine variance in task complexity in the same pipeline, not as a universal pattern. For agentic workloads specifically, per-step cost tracking is worth building in from the start: the patterns in our agentic AI production guide show how we log cost per step and use that data to make routing decisions at the task-class level.

One thing we got wrong in an early deployment: we routed too aggressively and sent complex reasoning tasks to Haiku. The output quality was fine 80% of the time and completely wrong 20% of the time. The client noticed. We added quality spot-checks to the router: sample 5% of tasks and verify the output with a reference model. When error rate on the cheap model crosses 3%, the router escalates that task type.

Semantic Caching: Pay for Inference Once

Users ask the same questions. Not identically, but semantically. “What’s the refund policy?” and “How do I get a refund?” are the same question. “Summarize this 8-page contract” from the same client twice a week is the same task.

Semantic caching stores the result of a query and retrieves it when a new query is semantically similar enough to skip inference. The implementation uses embedding similarity, not exact string matching.

import asyncpg
import numpy as np
from openai import AsyncOpenAI

SIMILARITY_THRESHOLD = 0.92  # Tune based on your domain

async def get_or_compute(
    query: str,
    compute_fn,
    cache_ttl_hours: int = 24,
    pool: asyncpg.Pool = None,
) -> dict:
    # Step 1: embed the query
    embedding_client = AsyncOpenAI()
    embedding_response = await embedding_client.embeddings.create(
        model="text-embedding-3-small",
        input=query,
    )
    query_embedding = embedding_response.data[0].embedding

    # Step 2: search cache with cosine similarity via pgvector
    # (<=> is cosine distance in pgvector; 1 - distance = similarity)
    cached = await pool.fetchrow("""
        SELECT result, 1 - (embedding <=> $1::vector) AS similarity
        FROM llm_cache
        WHERE created_at > NOW() - INTERVAL '1 hour' * $2
          AND 1 - (embedding <=> $1::vector) >= $3
        ORDER BY embedding <=> $1::vector
        LIMIT 1
    """, query_embedding, cache_ttl_hours, SIMILARITY_THRESHOLD)

    if cached:
        await log_cache_hit(query, cached["similarity"])
        return cached["result"]

    # Step 3: cache miss, run inference
    result = await compute_fn(query)

    # Step 4: store in cache
    await pool.execute("""
        INSERT INTO llm_cache (query, embedding, result, created_at)
        VALUES ($1, $2::vector, $3, NOW())
    """, query, query_embedding, result)

    return result

The schema for the cache table:

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE llm_cache (
    id          BIGSERIAL PRIMARY KEY,
    query       TEXT NOT NULL,
    embedding   vector(1536),          -- text-embedding-3-small dimensions
    result      JSONB NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON llm_cache USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);              -- Tune lists to sqrt(row_count)

CREATE INDEX ON llm_cache (created_at);  -- For TTL filtering

The similarity threshold is the most important parameter. Too low (0.80) and you get false hits: a question about “cancellation policy” returns a cached answer about “refund policy.” Too high (0.98) and you get almost no hits because natural language variance rarely produces that level of cosine similarity.

We’ve landed on 0.90-0.93 for most customer-facing chatbot workloads and 0.95+ for domain-specific technical queries where precision matters more. Measure your false hit rate by sampling cache hits and checking whether the returned result actually answers the new query. A 1% false hit rate on a customer support bot costs more in support tickets than the inference savings.

Cache hit rates we see in production:

DomainCache Hit RateNotes
Customer support chatbot28-35%High repetition in support queries
Document Q&A (same corpus)40-55%Users ask same questions about same docs
Code assistant12-18%More variance in coding queries
Classification pipeline60-70%Highly repetitive inputs
General-purpose chatbot15-25%More variance, lower hit rate

The embedding call itself costs $0.02/M tokens with text-embedding-3-small. For a 100-token query, that’s $0.000002. Compared to $0.001-0.015 for the inference call it replaces, the embedding cost is a rounding error.

pgvector is the right choice here if you’re already on Postgres. It’s available as an extension on Supabase, Railway, and most managed Postgres providers. We use it instead of a dedicated vector database for caches under 10M rows because it eliminates an infrastructure dependency and the query performance is indistinguishable for these workloads.

Prompt Compression: Fewer Tokens, Same Information

LLM APIs charge per token in and per token out. Input compression is deterministic and cheap. You don’t need a model call to strip tokens from a prompt; you need a scoring function that identifies which tokens carry the most information for the specific query.

LLMLingua is Microsoft’s open-source implementation of this idea. It uses a small language model (Llama 2 7B or similar) to score the importance of each token given the task, then removes low-importance tokens until you hit your target compression ratio.

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="openai-community/gpt2",   # Small scoring model
    device_map="cpu",                      # Works on CPU, just slower
    use_llmlingua2=True,                   # Use the v2 algorithm (better precision)
)

def compress_context(
    instruction: str,
    context: str,
    question: str,
    target_ratio: float = 0.4,   # Compress context to 40% of original
) -> str:
    result = compressor.compress_prompt(
        context,
        instruction=instruction,
        question=question,
        target_token=int(len(context.split()) * target_ratio * 1.3),  # tokens ≈ words * 1.3
        condition_compare=True,
        condition_in_question="after",
    )
    return result["compressed_prompt"]

The numbers from a RAG pipeline we optimized: context chunks were averaging 4,200 input tokens per query. After compression at a 0.4 ratio, they averaged 1,680 tokens. The response quality on our evaluation set (100 held-out questions with reference answers) dropped from 89.3% accuracy to 87.1%. The 2.2 percentage point quality drop saved 60% of context token costs on every query.

Whether that trade-off is acceptable depends on the task. For a customer support chatbot answering policy questions, 87% vs 89% accuracy is acceptable. For a compliance review system where errors have legal consequences, it is not.

Cheaper alternatives to LLMLingua for simpler compression:

  • Sentence scoring with TF-IDF. For retrieval-augmented contexts where you know the query, score each sentence by its relevance to the query and drop the bottom 30%. No model dependency, runs in milliseconds. Less precise than LLMLingua but good enough for document summaries and knowledge base lookups. We covered the chunking and retrieval side of this in RAG in production if you want the full context pipeline.
  • Structural truncation. If your context is a long document, keep the first 30% (usually the intro/summary), the last 20% (conclusion), and any section headers. Middle sections contain most of the repetition. This is a heuristic but works surprisingly well for linear documents.
  • Redundancy elimination. In RAG systems, deduplicate retrieved chunks by cosine similarity before sending to the model. We frequently see the same paragraph retrieved 3-4 times from different chunks. Running a dedup pass at similarity > 0.85 removes 15-25% of input tokens with zero quality loss.

Batching and Async Scheduling

Not all LLM workloads need real-time responses. Data pipelines, report generation, classification jobs, content creation, analytics summarization. These can run at any time.

Both Anthropic and OpenAI offer batch API endpoints with 50% cost discounts in exchange for a 24-hour response window.

import anthropic
import asyncio
import json
from pathlib import Path

async def batch_classify_documents(documents: list[dict]) -> list[dict]:
    client = anthropic.AsyncAnthropic()

    # Prepare batch requests (max 10,000 per batch)
    requests = [
        {
            "custom_id": doc["id"],
            "params": {
                "model": "claude-3-5-haiku-20241022",
                "max_tokens": 100,
                "messages": [{
                    "role": "user",
                    "content": f"Classify this document. Categories: [contract, invoice, report, correspondence]. Return JSON: {{\"category\": \"...\", \"confidence\": 0.0}}\n\n{doc['text'][:3000]}"
                }]
            }
        }
        for doc in documents
    ]

    # Submit batch
    batch = await client.messages.batches.create(requests=requests)
    print(f"Batch {batch.id} submitted. Status: {batch.processing_status}")

    # Poll until complete (real usage: use a webhook or scheduled check)
    while batch.processing_status == "in_progress":
        await asyncio.sleep(300)  # Check every 5 minutes
        batch = await client.messages.batches.retrieve(batch.id)
        print(f"Status: {batch.processing_status}, {batch.request_counts}")

    # Collect results
    results = []
    async for result in await client.messages.batches.results(batch.id):
        if result.result.type == "succeeded":
            content = result.result.message.content[0].text
            results.append({
                "id": result.custom_id,
                "classification": json.loads(content)
            })
        else:
            results.append({
                "id": result.custom_id,
                "error": result.result.error.type
            })

    return results

The 50% discount compounds with model routing. If you route batch tasks to Haiku ($0.80/M input) and apply the batch discount, effective cost is $0.40/M input. Compare that to gpt-4o at real-time pricing: $2.50/M input. That’s an 84% cost difference for the same batch classification task.

When batching doesn’t work:

User-facing workloads with sub-2-second latency requirements can’t use batch APIs. Obvious, but worth stating: the discount is only useful for async pipelines where latency tolerance is high. The common pattern in our systems is to separate the pipeline into real-time paths (routing, retrieval, response generation) and async paths (analytics, content classification, report generation, training data labeling). The async paths go through batch APIs.

One tricky case: RAG pipelines where documents are embedded and classified on ingest. These look real-time (the user uploads a document and expects to search it shortly after) but tolerate 10-30 second processing delays. That’s short enough for real-time APIs but long enough to potentially batch with other pending documents. We use a small queue and flush it every 30 seconds, which gives us near-real-time UX without the real-time API cost.

Putting It Together: A Real Cost Calculation

A client came to us with an AI pipeline processing 50,000 requests per day across five task types: intent classification, retrieval Q&A, document summarization, code generation, and compliance checking.

Their starting configuration: everything on gpt-4o.

Daily cost estimate (pre-optimization):

TaskRequests/dayAvg tokens (in+out)Daily cost
Intent classification25,000450$28.13
Retrieval Q&A12,0002,200$132.00
Summarization8,0003,500$140.00
Code generation3,0004,500$67.50
Compliance check2,0003,000$30.00
Total50,000$397.63/day

After optimization (model routing + semantic caching + batching where applicable):

TaskModelCache hit rateEffective requestsDaily cost
Intent classification (batch)gpt-4o-mini (50% batch discount)65%8,750$0.33
Retrieval Q&Aclaude-3-5-haiku32%8,160$3.67
Summarization (batch)claude-3-5-haiku (50% discount)40%4,800$1.34
Code generationclaude-3-5-sonnet15%2,550$11.48
Compliance checkclaude-3-5-sonnet0% (no caching for compliance)2,000$9.00
Total$25.82/day

$397.63 → $25.82. A 93.5% cost reduction.

The compliance checks stayed on Sonnet with no caching because the client was in financial services and needed full auditability: every check runs fresh and gets logged with full context. No semantic similarity shortcuts for a task where being wrong has regulatory consequences.

What We Haven’t Solved Yet

Cold start on semantic cache. The cache is worthless for the first few weeks of a new deployment. You pay full inference costs until enough query patterns accumulate to generate hits. For new features, expect 4-6 weeks before caching delivers meaningful savings.

Quality drift detection. When you route tasks to cheaper models, you need to know when quality degrades. Our current approach: sample 3-5% of routed responses and run a reference check with a larger model. When the disagreement rate crosses 3%, escalate that task class back to the premium tier. It works but it’s not elegant. We’d rather have a continuous quality signal, and we don’t.

Prompt compression for structured extraction. LLMLingua is designed for document QA. When the task is extracting specific fields from a document (like parsing invoices or contracts), aggressive compression can drop the exact tokens containing the fields you need. We use compression conservatively (0.7 ratio, not 0.4) for extraction tasks, and we haven’t found a reliable way to automatically detect when compression would drop the relevant fields.

FAQ

What’s the actual cost difference between GPT-4o and GPT-4o Mini for real tasks?

At current pricing, GPT-4o costs $2.50/M input tokens and GPT-4o Mini costs $0.15/M. That’s a 16.7x difference on input tokens. On output, it’s $10.00 vs $0.60/M (16.7x again). For a task that uses 1,000 input + 500 output tokens, GPT-4o costs $0.0075 and GPT-4o Mini costs $0.00045. For tasks you run 100,000 times per day, the difference is $750/day vs $45/day. The question is never “which is cheaper” but “which is good enough.” For classification and short extraction, Mini is consistently good enough in our testing.

Does semantic caching work for RAG systems?

It works well when the underlying document corpus is stable. If you’re answering questions about a knowledge base that changes weekly, use a shorter TTL (4-12 hours instead of 24). If the corpus changes continuously (live data, news feeds), semantic caching adds complexity without much benefit because the answers go stale quickly. The highest-value use case is customer support over stable documentation: policy docs, product FAQs, pricing. These change rarely and the query patterns are highly repetitive.

How do I implement model routing without adding latency to every request?

The routing decision should be made before making the API call, not as a separate model call. Classify tasks at the API call site based on task type, input length, and whether tool calls are required. These are cheap heuristics that run in under 1ms. Avoid routing systems that call a model to decide which model to call. That doubles your minimum latency for every request and only makes sense when the task type genuinely cannot be determined statically.

When should I use the Anthropic Batch API vs real-time API?

Use the batch API for any pipeline where the user is not waiting for the response in real time. Report generation, document classification, data enrichment, training data labeling, analytics summarization. The 24-hour window sounds long, but most of these tasks run overnight anyway. The 50% discount is guaranteed and requires zero architectural changes beyond switching the API endpoint. If you have any significant async workload and are not using batch APIs, you are paying 2x the necessary cost.

Is prompt compression safe to use in production?

For informational tasks (summarization, Q&A over documents, content generation), yes. For high-stakes tasks (compliance, medical, legal, financial), the quality drop from compression is not acceptable. Our rule: if a wrong answer costs money or creates liability, use full prompts. If a slightly imprecise answer is fine, compress. And always run a quality evaluation on your specific task before deploying compression at scale. The 2.2% quality drop we saw on our RAG pipeline might be 8% on yours depending on your corpus and query distribution.


If your LLM costs are growing faster than your product, book a 30-minute call and we’ll pull up the numbers on your pipeline. Usually takes one look at the call distribution to find where 80% of the cost is going.

#llm cost optimization#ai software development#model routing#prompt compression#llm caching#production AI
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Anil Gulecha

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us