Free resource

Cloudflare Workers AI: Production Deployment Checklist

28 items. The pre-launch pass we run before any Workers-based AI system goes live. Pulled from three production systems (RAG, conversational agents, and a streaming compliance analyzer) we operate in 2026.

Companion to the full write-up: Production AI on Cloudflare Workers: Architecture Guide.

01

Architecture & compute split

  • Workers is the orchestration layer, not the compute layer. LLM inference, embedding generation, OCR, and anything CPU-heavy runs in external services. Workers routes, validates, and streams.
  • Split your pipeline into stages with explicit boundaries. Classification, extraction, generation, and scoring should be separately callable and separately logged. Monolithic handlers become unprofitable within a month.
  • Route by task, not by "best model." Cheap tier (Haiku, 4o-mini) handles classification, extraction, routing. Expensive tier only for open-ended generation and nuanced reasoning. Aim for 70%+ of token volume on the cheap tier.
  • Decide on Vectorize vs external vector DB early. Native Vectorize is best under 2M vectors with simple metadata filters. For complex boolean filters or 5M+ vectors, self-host Qdrant or pgvector and proxy via Workers.
02

CPU budget discipline

  • Know your budget. 10ms CPU on free, 30ms on paid. CPU time is synchronous JavaScript only; wait time on outbound calls is free.
  • Cap request body size before JSON.parse. An 18KB Deepgram transcript burned us for 4ms just on the initial parse. Slice or reject oversized payloads in the fetch handler, not after parsing.
  • No vector math in JavaScript. Cosine similarity, re-ranking, matrix ops on 1536-d vectors will eat 10-15ms. Offload to Workers AI (GPU-side) or skip the step.
  • Profile with performance.now() in staging. Measure the sync CPU cost of every non-trivial function. Anything above 3ms is a tail-risk candidate.
  • Parse the minimum you need. If the request body is large but you only need a query and tenantId, extract and cap them; don't deserialize the whole object to read two fields.
03

Streaming & response handling

  • Stream everything user-facing. First token in 200-400ms vs 3-8s blank screen. Use TransformStream to pipe the provider SSE directly to the client.
  • Convert Anthropic SSE format to OpenAI format if your frontend expects it. A small TransformStream maps content_block_deltachoices[0].delta.content.
  • Set transfer-encoding: chunked and cache-control: no-cache on streaming responses. Missing these causes some middleboxes to buffer and break the stream.
  • Handle provider errors on the stream, not just the initial response. A 200 OK can still produce an error event mid-stream; surface it to the client.
04

Vectorize & RAG

  • Keep chunks under ~2,500 tokens so chunk text fits in Vectorize metadata (10KB limit). Otherwise store chunk IDs and fetch body from R2 or D1.
  • Always pass a tenantId filter (or equivalent) on every query. A missing filter on a multi-tenant index is a data leak.
  • Cap topK between 4 and 8. More matches rarely improve answer quality; they just cost context tokens. Re-rank only if you have an eval that shows it helps.
  • Plan for metadata complexity up front. If you need OR across fields, Vectorize can't express it. Migrating to Qdrant mid-project is a week of work.
05

D1, KV & persistent state

  • Index (conversation_id, created_at DESC) on messages. Without it, message loads fan out across the whole table.
  • Handle read-your-own-write on D1 replicas. A user who just posted a message may hit a replica that hasn't synced. Cache the session object in the request context instead of re-reading D1.
  • Summarize conversations over 8K tokens with a Haiku-tier model. Run it via ctx.waitUntil so it doesn't block the current response.
  • Use KV for session tokens and idempotency keys. Reads are under 1ms; perfect for per-request auth checks without adding latency.
06

Reliability & cost

  • Turn on prompt caching for any system prompt over ~1K tokens that's reused. 90% discount on Anthropic cached reads, 50% on OpenAI. Pays off within a handful of requests.
  • Build a 50-sample eval before migrating any step to a cheaper model. Run cheap vs expensive, score with a judge model or humans. No guessing.
  • Rate-limit upstream API usage per tenant using a KV token bucket. A single abusive tenant can exhaust your OpenAI rate limits in minutes.
  • Offload multi-step agents (5+ tool calls, ~20s of wall-clock time) to a separate Fly.io or Railway instance. Use Workers as the gateway. The 30s limit catches you eventually.
07

Deploy & observability

  • Secrets go in via wrangler secret put, never in wrangler.jsonc. Cloudflare encrypts at rest; config files get committed to git.
  • Log tokens, model, latency, and tenant on every LLM call. Aggregate daily. You'll need this the first time a cost spike needs explaining.
  • Ship a /health endpoint that touches D1, KV, and Vectorize. External uptime monitoring that doesn't exercise the bindings misses the real failure modes.
  • Gate the canary with Cloudflare's deployment percentages. Ship to 10% of traffic for 24h, watch error rates and p95 latency, then promote.
08

Starter wrangler.jsonc

The skeleton we start every Workers AI project with. Fill in the IDs with wrangler kv:namespace create, wrangler d1 create, and wrangler vectorize create.

{
  "name": "ai-assistant",
  "main": "src/index.ts",
  "compatibility_date": "2026-01-01",
  "compatibility_flags": ["nodejs_compat"],

  "kv_namespaces": [
    { "binding": "SESSIONS", "id": "<kv_id>" }
  ],

  "d1_databases": [
    { "binding": "DB", "database_name": "ai-conversations", "database_id": "<d1_id>" }
  ],

  "vectorize": [
    { "binding": "VECTORIZE_INDEX", "index_name": "documents" }
  ],

  "vars": { "ENVIRONMENT": "production" }
}

What to read next

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Architecting something on Workers?

Tell us the workload. We'll tell you honestly whether Workers is the right layer for it, or where it breaks.

Book a 30-minute technical call →
Chat with us