LLM API costs follow a predictable pattern across every startup we’ve worked with. Month one: “this is affordable.” Month two: “traffic is growing, costs are up 3x.” Month three: the founder sends a message asking why the API bill is now larger than engineering salaries.
The answer is almost always the same. They picked one model and called it for everything. GPT-4o or Claude Sonnet for every request: the three-word customer support reply, the 5,000-token document analysis, the simple classification step in a pipeline that runs 50,000 times a day.
This is the LLM equivalent of using a freight truck to deliver envelopes. The cost structure breaks before the product scales.
We’ve shipped AI systems for startups where the final cost is 15-20% of what a naive single-model approach would have cost, with the same output quality on every task that matters to users. These are the four techniques that do most of the work.
Model Routing: The 80/20 of Cost Reduction
The highest-leverage change is the simplest: stop calling expensive models for tasks that don’t need expensive models.
In most AI systems, 70-85% of inference requests are doing one of four things: classification, summarization of short inputs, structured extraction from well-formatted data, or generating short deterministic responses to common queries. These do not require frontier reasoning models. They require a model that can follow instructions reliably and return a consistent output format.
Here is the routing layer we build into every system:
from enum import Enum
from dataclasses import dataclass
class TaskComplexity(Enum):
TRIVIAL = "trivial" # Classification, short extraction, routing
STANDARD = "standard" # Summarization, Q&A over context, drafts
COMPLEX = "complex" # Reasoning chains, code generation, synthesis
CRITICAL = "critical" # Legal/compliance/medical, final outputs
# Current cost reference (per 1M tokens, input/output)
MODEL_COSTS = {
"claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.00},
"claude-3-5-sonnet-20241022": {"input": 3.00, "output": 15.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-4o": {"input": 2.50, "output": 10.00},
"llama-3.1-8b": {"input": 0.06, "output": 0.06}, # Groq
"llama-3.1-70b": {"input": 0.59, "output": 0.79}, # Groq
}
ROUTING_TABLE = {
TaskComplexity.TRIVIAL: "gpt-4o-mini", # $0.15/M input
TaskComplexity.STANDARD: "claude-3-5-haiku-20241022", # $0.80/M input
TaskComplexity.COMPLEX: "claude-3-5-sonnet-20241022", # $3.00/M input
TaskComplexity.CRITICAL: "claude-3-5-sonnet-20241022", # Non-negotiable
}
def classify_task(task_type: str, input_tokens: int, has_tool_calls: bool) -> TaskComplexity:
if task_type in ("classify", "extract_field", "route_intent") and input_tokens < 500:
return TaskComplexity.TRIVIAL
if task_type in ("summarize", "qa_retrieval", "generate_reply") and not has_tool_calls:
if input_tokens < 2000:
return TaskComplexity.STANDARD
if task_type in ("code_generation", "multi_step_reasoning", "synthesis"):
return TaskComplexity.COMPLEX
if task_type in ("compliance_check", "legal_review", "medical_content"):
return TaskComplexity.CRITICAL
# Default to standard, not complex
return TaskComplexity.STANDARD
async def route_and_call(task_type: str, messages: list, tools: list = None) -> dict:
input_tokens = estimate_tokens(messages)
has_tools = bool(tools)
complexity = classify_task(task_type, input_tokens, has_tools)
model = ROUTING_TABLE[complexity]
response = await call_model(model=model, messages=messages, tools=tools)
# Log for cost tracking and routing accuracy review
await log_routing_event(task_type, model, complexity, input_tokens, response.usage)
return response
The numbers behind this: if you have a pipeline that runs 100,000 classification calls per day, the difference between routing them to gpt-4o-mini ($0.15/M) versus gpt-4o ($2.50/M) is $16.70 vs $250 per day on input tokens alone. The output savings stack on top.
Where routing fails. The classification step itself requires a model call, which costs tokens. If your tasks are all short and you’re routing to the same cheap model regardless, the overhead of the router might exceed the savings. We use routing when there’s genuine variance in task complexity in the same pipeline, not as a universal pattern. For agentic workloads specifically, per-step cost tracking is worth building in from the start: the patterns in our agentic AI production guide show how we log cost per step and use that data to make routing decisions at the task-class level.
One thing we got wrong in an early deployment: we routed too aggressively and sent complex reasoning tasks to Haiku. The output quality was fine 80% of the time and completely wrong 20% of the time. The client noticed. We added quality spot-checks to the router: sample 5% of tasks and verify the output with a reference model. When error rate on the cheap model crosses 3%, the router escalates that task type.
Semantic Caching: Pay for Inference Once
Users ask the same questions. Not identically, but semantically. “What’s the refund policy?” and “How do I get a refund?” are the same question. “Summarize this 8-page contract” from the same client twice a week is the same task.
Semantic caching stores the result of a query and retrieves it when a new query is semantically similar enough to skip inference. The implementation uses embedding similarity, not exact string matching.
import asyncpg
import numpy as np
from openai import AsyncOpenAI
SIMILARITY_THRESHOLD = 0.92 # Tune based on your domain
async def get_or_compute(
query: str,
compute_fn,
cache_ttl_hours: int = 24,
pool: asyncpg.Pool = None,
) -> dict:
# Step 1: embed the query
embedding_client = AsyncOpenAI()
embedding_response = await embedding_client.embeddings.create(
model="text-embedding-3-small",
input=query,
)
query_embedding = embedding_response.data[0].embedding
# Step 2: search cache with cosine similarity via pgvector
# (<=> is cosine distance in pgvector; 1 - distance = similarity)
cached = await pool.fetchrow("""
SELECT result, 1 - (embedding <=> $1::vector) AS similarity
FROM llm_cache
WHERE created_at > NOW() - INTERVAL '1 hour' * $2
AND 1 - (embedding <=> $1::vector) >= $3
ORDER BY embedding <=> $1::vector
LIMIT 1
""", query_embedding, cache_ttl_hours, SIMILARITY_THRESHOLD)
if cached:
await log_cache_hit(query, cached["similarity"])
return cached["result"]
# Step 3: cache miss, run inference
result = await compute_fn(query)
# Step 4: store in cache
await pool.execute("""
INSERT INTO llm_cache (query, embedding, result, created_at)
VALUES ($1, $2::vector, $3, NOW())
""", query, query_embedding, result)
return result
The schema for the cache table:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE llm_cache (
id BIGSERIAL PRIMARY KEY,
query TEXT NOT NULL,
embedding vector(1536), -- text-embedding-3-small dimensions
result JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW()
);
CREATE INDEX ON llm_cache USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100); -- Tune lists to sqrt(row_count)
CREATE INDEX ON llm_cache (created_at); -- For TTL filtering
The similarity threshold is the most important parameter. Too low (0.80) and you get false hits: a question about “cancellation policy” returns a cached answer about “refund policy.” Too high (0.98) and you get almost no hits because natural language variance rarely produces that level of cosine similarity.
We’ve landed on 0.90-0.93 for most customer-facing chatbot workloads and 0.95+ for domain-specific technical queries where precision matters more. Measure your false hit rate by sampling cache hits and checking whether the returned result actually answers the new query. A 1% false hit rate on a customer support bot costs more in support tickets than the inference savings.
Cache hit rates we see in production:
| Domain | Cache Hit Rate | Notes |
|---|---|---|
| Customer support chatbot | 28-35% | High repetition in support queries |
| Document Q&A (same corpus) | 40-55% | Users ask same questions about same docs |
| Code assistant | 12-18% | More variance in coding queries |
| Classification pipeline | 60-70% | Highly repetitive inputs |
| General-purpose chatbot | 15-25% | More variance, lower hit rate |
The embedding call itself costs $0.02/M tokens with text-embedding-3-small. For a 100-token query, that’s $0.000002. Compared to $0.001-0.015 for the inference call it replaces, the embedding cost is a rounding error.
pgvector is the right choice here if you’re already on Postgres. It’s available as an extension on Supabase, Railway, and most managed Postgres providers. We use it instead of a dedicated vector database for caches under 10M rows because it eliminates an infrastructure dependency and the query performance is indistinguishable for these workloads.
Prompt Compression: Fewer Tokens, Same Information
LLM APIs charge per token in and per token out. Input compression is deterministic and cheap. You don’t need a model call to strip tokens from a prompt; you need a scoring function that identifies which tokens carry the most information for the specific query.
LLMLingua is Microsoft’s open-source implementation of this idea. It uses a small language model (Llama 2 7B or similar) to score the importance of each token given the task, then removes low-importance tokens until you hit your target compression ratio.
from llmlingua import PromptCompressor
compressor = PromptCompressor(
model_name="openai-community/gpt2", # Small scoring model
device_map="cpu", # Works on CPU, just slower
use_llmlingua2=True, # Use the v2 algorithm (better precision)
)
def compress_context(
instruction: str,
context: str,
question: str,
target_ratio: float = 0.4, # Compress context to 40% of original
) -> str:
result = compressor.compress_prompt(
context,
instruction=instruction,
question=question,
target_token=int(len(context.split()) * target_ratio * 1.3), # tokens ≈ words * 1.3
condition_compare=True,
condition_in_question="after",
)
return result["compressed_prompt"]
The numbers from a RAG pipeline we optimized: context chunks were averaging 4,200 input tokens per query. After compression at a 0.4 ratio, they averaged 1,680 tokens. The response quality on our evaluation set (100 held-out questions with reference answers) dropped from 89.3% accuracy to 87.1%. The 2.2 percentage point quality drop saved 60% of context token costs on every query.
Whether that trade-off is acceptable depends on the task. For a customer support chatbot answering policy questions, 87% vs 89% accuracy is acceptable. For a compliance review system where errors have legal consequences, it is not.
Cheaper alternatives to LLMLingua for simpler compression:
- Sentence scoring with TF-IDF. For retrieval-augmented contexts where you know the query, score each sentence by its relevance to the query and drop the bottom 30%. No model dependency, runs in milliseconds. Less precise than LLMLingua but good enough for document summaries and knowledge base lookups. We covered the chunking and retrieval side of this in RAG in production if you want the full context pipeline.
- Structural truncation. If your context is a long document, keep the first 30% (usually the intro/summary), the last 20% (conclusion), and any section headers. Middle sections contain most of the repetition. This is a heuristic but works surprisingly well for linear documents.
- Redundancy elimination. In RAG systems, deduplicate retrieved chunks by cosine similarity before sending to the model. We frequently see the same paragraph retrieved 3-4 times from different chunks. Running a dedup pass at similarity > 0.85 removes 15-25% of input tokens with zero quality loss.
Batching and Async Scheduling
Not all LLM workloads need real-time responses. Data pipelines, report generation, classification jobs, content creation, analytics summarization. These can run at any time.
Both Anthropic and OpenAI offer batch API endpoints with 50% cost discounts in exchange for a 24-hour response window.
import anthropic
import asyncio
import json
from pathlib import Path
async def batch_classify_documents(documents: list[dict]) -> list[dict]:
client = anthropic.AsyncAnthropic()
# Prepare batch requests (max 10,000 per batch)
requests = [
{
"custom_id": doc["id"],
"params": {
"model": "claude-3-5-haiku-20241022",
"max_tokens": 100,
"messages": [{
"role": "user",
"content": f"Classify this document. Categories: [contract, invoice, report, correspondence]. Return JSON: {{\"category\": \"...\", \"confidence\": 0.0}}\n\n{doc['text'][:3000]}"
}]
}
}
for doc in documents
]
# Submit batch
batch = await client.messages.batches.create(requests=requests)
print(f"Batch {batch.id} submitted. Status: {batch.processing_status}")
# Poll until complete (real usage: use a webhook or scheduled check)
while batch.processing_status == "in_progress":
await asyncio.sleep(300) # Check every 5 minutes
batch = await client.messages.batches.retrieve(batch.id)
print(f"Status: {batch.processing_status}, {batch.request_counts}")
# Collect results
results = []
async for result in await client.messages.batches.results(batch.id):
if result.result.type == "succeeded":
content = result.result.message.content[0].text
results.append({
"id": result.custom_id,
"classification": json.loads(content)
})
else:
results.append({
"id": result.custom_id,
"error": result.result.error.type
})
return results
The 50% discount compounds with model routing. If you route batch tasks to Haiku ($0.80/M input) and apply the batch discount, effective cost is $0.40/M input. Compare that to gpt-4o at real-time pricing: $2.50/M input. That’s an 84% cost difference for the same batch classification task.
When batching doesn’t work:
User-facing workloads with sub-2-second latency requirements can’t use batch APIs. Obvious, but worth stating: the discount is only useful for async pipelines where latency tolerance is high. The common pattern in our systems is to separate the pipeline into real-time paths (routing, retrieval, response generation) and async paths (analytics, content classification, report generation, training data labeling). The async paths go through batch APIs.
One tricky case: RAG pipelines where documents are embedded and classified on ingest. These look real-time (the user uploads a document and expects to search it shortly after) but tolerate 10-30 second processing delays. That’s short enough for real-time APIs but long enough to potentially batch with other pending documents. We use a small queue and flush it every 30 seconds, which gives us near-real-time UX without the real-time API cost.
Putting It Together: A Real Cost Calculation
A client came to us with an AI pipeline processing 50,000 requests per day across five task types: intent classification, retrieval Q&A, document summarization, code generation, and compliance checking.
Their starting configuration: everything on gpt-4o.
Daily cost estimate (pre-optimization):
| Task | Requests/day | Avg tokens (in+out) | Daily cost |
|---|---|---|---|
| Intent classification | 25,000 | 450 | $28.13 |
| Retrieval Q&A | 12,000 | 2,200 | $132.00 |
| Summarization | 8,000 | 3,500 | $140.00 |
| Code generation | 3,000 | 4,500 | $67.50 |
| Compliance check | 2,000 | 3,000 | $30.00 |
| Total | 50,000 | $397.63/day |
After optimization (model routing + semantic caching + batching where applicable):
| Task | Model | Cache hit rate | Effective requests | Daily cost |
|---|---|---|---|---|
| Intent classification (batch) | gpt-4o-mini (50% batch discount) | 65% | 8,750 | $0.33 |
| Retrieval Q&A | claude-3-5-haiku | 32% | 8,160 | $3.67 |
| Summarization (batch) | claude-3-5-haiku (50% discount) | 40% | 4,800 | $1.34 |
| Code generation | claude-3-5-sonnet | 15% | 2,550 | $11.48 |
| Compliance check | claude-3-5-sonnet | 0% (no caching for compliance) | 2,000 | $9.00 |
| Total | $25.82/day |
$397.63 → $25.82. A 93.5% cost reduction.
The compliance checks stayed on Sonnet with no caching because the client was in financial services and needed full auditability: every check runs fresh and gets logged with full context. No semantic similarity shortcuts for a task where being wrong has regulatory consequences.
What We Haven’t Solved Yet
Cold start on semantic cache. The cache is worthless for the first few weeks of a new deployment. You pay full inference costs until enough query patterns accumulate to generate hits. For new features, expect 4-6 weeks before caching delivers meaningful savings.
Quality drift detection. When you route tasks to cheaper models, you need to know when quality degrades. Our current approach: sample 3-5% of routed responses and run a reference check with a larger model. When the disagreement rate crosses 3%, escalate that task class back to the premium tier. It works but it’s not elegant. We’d rather have a continuous quality signal, and we don’t.
Prompt compression for structured extraction. LLMLingua is designed for document QA. When the task is extracting specific fields from a document (like parsing invoices or contracts), aggressive compression can drop the exact tokens containing the fields you need. We use compression conservatively (0.7 ratio, not 0.4) for extraction tasks, and we haven’t found a reliable way to automatically detect when compression would drop the relevant fields.
FAQ
What’s the actual cost difference between GPT-4o and GPT-4o Mini for real tasks?
At current pricing, GPT-4o costs $2.50/M input tokens and GPT-4o Mini costs $0.15/M. That’s a 16.7x difference on input tokens. On output, it’s $10.00 vs $0.60/M (16.7x again). For a task that uses 1,000 input + 500 output tokens, GPT-4o costs $0.0075 and GPT-4o Mini costs $0.00045. For tasks you run 100,000 times per day, the difference is $750/day vs $45/day. The question is never “which is cheaper” but “which is good enough.” For classification and short extraction, Mini is consistently good enough in our testing.
Does semantic caching work for RAG systems?
It works well when the underlying document corpus is stable. If you’re answering questions about a knowledge base that changes weekly, use a shorter TTL (4-12 hours instead of 24). If the corpus changes continuously (live data, news feeds), semantic caching adds complexity without much benefit because the answers go stale quickly. The highest-value use case is customer support over stable documentation: policy docs, product FAQs, pricing. These change rarely and the query patterns are highly repetitive.
How do I implement model routing without adding latency to every request?
The routing decision should be made before making the API call, not as a separate model call. Classify tasks at the API call site based on task type, input length, and whether tool calls are required. These are cheap heuristics that run in under 1ms. Avoid routing systems that call a model to decide which model to call. That doubles your minimum latency for every request and only makes sense when the task type genuinely cannot be determined statically.
When should I use the Anthropic Batch API vs real-time API?
Use the batch API for any pipeline where the user is not waiting for the response in real time. Report generation, document classification, data enrichment, training data labeling, analytics summarization. The 24-hour window sounds long, but most of these tasks run overnight anyway. The 50% discount is guaranteed and requires zero architectural changes beyond switching the API endpoint. If you have any significant async workload and are not using batch APIs, you are paying 2x the necessary cost.
Is prompt compression safe to use in production?
For informational tasks (summarization, Q&A over documents, content generation), yes. For high-stakes tasks (compliance, medical, legal, financial), the quality drop from compression is not acceptable. Our rule: if a wrong answer costs money or creates liability, use full prompts. If a slightly imprecise answer is fine, compress. And always run a quality evaluation on your specific task before deploying compression at scale. The 2.2% quality drop we saw on our RAG pipeline might be 8% on yours depending on your corpus and query distribution.
If your LLM costs are growing faster than your product, book a 30-minute call and we’ll pull up the numbers on your pipeline. Usually takes one look at the call distribution to find where 80% of the cost is going.