Technical

April 10, 2026 · 16 min read

Production AI on Cloudflare Workers: Architecture Guide

Q: Can I run a Python AI framework like LangChain on Cloudflare Workers?

Not directly. Workers runs JavaScript and TypeScript only (via V8 isolates). LangChain has a TypeScript port (langchain-ai/langchainjs) that works in Workers, though not all integrations are available and some internal operations hit the CPU limit for large document sets. For Python-based AI code, you need a separate compute layer (Fly.io, Cloud Run, or a Lambda function) and use Workers as the API gateway in front of it.

Q: How does the Cloudflare Workers AI service compare to calling OpenAI directly?

Workers AI runs inference on Cloudflare's own GPU infrastructure, accessible from Workers with zero network latency. The tradeoff: the model selection is more limited than OpenAI or Anthropic, and the most capable models (Claude 3.5 Sonnet, GPT-4o) aren't available there. We use Workers AI for embedding generation and lightweight classification tasks where latency matters more than capability. For final generation and complex reasoning, we still call OpenAI or Anthropic.

Q: What's the right approach for handling large documents on Workers?

Split the work. Use Workers as an ingestion endpoint that receives the document, validates it, uploads it to R2 for storage, and then triggers a Cloudflare Queue message for processing. The actual chunking, embedding, and Vectorize upsert happen in a separate Queue Consumer Worker, which has its own CPU budget and can process asynchronously. This pattern keeps the ingestion request fast (under 500ms) and decouples it from the processing time of large documents.

Q: How do I handle rate limiting for LLM APIs in Workers?

Use Cloudflare's Rate Limiting API for per-user or per-tenant limits. For LLM API rate limits from providers, build a token bucket counter in KV: increment on each request, check the count before making the API call, and return 429 if the tenant has exceeded their per-minute or per-day limit. The KV operation takes under 1ms, which is acceptable overhead for every LLM request.

Q: Does the 128MB memory limit cause problems for RAG systems?

Not usually. The in-memory footprint of a RAG request is: the query embedding (1536 floats = ~6KB), the top-8 retrieved chunks (maybe 8KB of text), and the response streaming buffer (a few KB at a time). That's well under 128MB. The memory limit becomes a problem if you try to hold large conversation histories in memory, cache large document sets per-request, or load npm packages with large dependency trees. Keep request-level state small and externalize everything else to D1, KV, or R2. --- *Building an AI product and evaluating whether Cloudflare Workers fits your architecture? Book a 30-minute technical call and I'll walk through the right infrastructure choices for your workload and team.*

How to architect AI inference, RAG pipelines, and agent workflows on Cloudflare Workers. Cold starts, CPU limits, streaming, and real tradeoffs.

Anil Gulecha

Ex-HackerRank, Ex-Google

Production AI on Cloudflare Workers: Architecture Guide

TL;DR

Cloudflare Workers runs on V8 isolates, not Node.js containers. You get 128MB memory, a 10ms CPU limit (soft, but real), and no native modules. These constraints force architectural decisions you can avoid on other platforms.
For AI inference, Workers acts as an orchestration layer: it calls external LLM APIs, handles streaming, and routes requests. Don't try to run models locally on Workers.
Cloudflare Vectorize is the right vector store for Workers-native RAG if your corpus is under 5M vectors. For heavier filtering needs, an external Qdrant or pgvector instance is better.
The 10ms CPU limit catches people off-out. JSON parsing, embedding generation, and synchronous operations that look cheap add up fast. Profile everything.
We run three client AI systems on Workers in production. Cold starts are under 100ms, global latency is under 80ms p50. It works, with the right architecture.

On this page

We hit the Cloudflare Workers CPU limit at 4ms into a JSON parsing operation.

The payload was 18KB: a Deepgram transcript with speaker diarization data. We were doing JSON.parse() on the whole thing before chunking it for embedding. Totally normal in a Node.js Lambda. On Workers, it consumed 40% of our available CPU time before we’d done anything useful.

That was the moment we understood: Workers isn’t a cheaper Lambda. It’s a different execution model that requires a different way of thinking about AI workloads. The constraints are real, but so are the advantages. Global deployment in 300+ data centers, cold starts under 100ms, and a development model that pushes you toward stateless request handling.

This is what we’ve learned building three production AI systems on Workers over the last year.

What Workers Actually Is (and Isn’t)

Workers runs on V8 isolates, not containers. Each isolate is a sandboxed JavaScript engine. There’s no filesystem. No native add-ons. No child_process. The runtime is closer to a browser service worker than to Node.js.

The implications for AI workloads:

Memory: 128MB per isolate. A naive LangChain chain with full conversation history and embedded documents will hit this. You need to externalize state.

CPU time: 10ms on the free plan, 30ms on paid (Workers Paid, $5/month). This is wall-clock CPU time, not total request duration. A request that waits for an external API call doesn’t consume CPU during the wait. But your synchronous JavaScript (parsing, encoding, chunking, sorting) all counts.

Request duration: Up to 30 seconds on paid plans for HTTP requests. Enough for most LLM calls, tight for multi-step agent pipelines.

No native modules: No node:crypto, no C++ bindings. This rules out some Python-based ML libraries that people try to run server-side. It also rules out sharp for image processing and a handful of popular npm packages that ship native binaries.

What Works doesn’t constrain you on: outbound HTTP requests, WebSockets, streaming responses, KV reads, D1 SQL queries, R2 object reads, and Vectorize similarity search. Those are all handled in the Workers runtime efficiently.

Request Handling: The Right Split

The pattern that works for AI workloads: Workers handles routing, auth, and lightweight orchestration. LLM inference, embedding generation, and heavy computation happen in external services. Workers is the glue layer, not the compute layer. It’s a similar split to what we described in the prompt architecture post: a lightweight router makes decisions, and specialized components do the actual work.

Here’s the request flow we use for a RAG endpoint:

// src/index.ts
import { Env } from './types';
import { retrieveContext } from './retrieval';
import { streamCompletion } from './completion';
import { authenticate } from './auth';

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // Authentication is a KV lookup (fast, no CPU cost)
    const authResult = await authenticate(request, env);
    if (authResult.ok === false) {
      return new Response('Unauthorized', { status: 401 });
    }

    // Parse only what we need, not the full body
    const { query, tenantId } = await parseQuery(request);

    // Retrieval: Vectorize similarity search
    // This is an outbound HTTP call; doesn't count against CPU
    const context = await retrieveContext(query, tenantId, env);

    // Stream the LLM response back to the client
    // TransformStream lets us proxy the OpenAI/Anthropic SSE stream
    return streamCompletion(query, context, env);
  }
};

async function parseQuery(request: Request): Promise<{ query: string; tenantId: string }> {
  // Parse only the fields we need; don't deserialize the whole body if it's large
  const body = await request.json() as Record<string, unknown>;
  return {
    query: String(body.query ?? '').slice(0, 2000),  // hard cap
    tenantId: String(body.tenantId ?? ''),
  };
}

The key constraint in parseQuery: we only parse what we need and we cap input length immediately. On a previous project, we were passing the full user message (potentially megabytes of conversation history) into the Workers function before filtering it. That was consuming 2-4ms of CPU just on the initial parse.

Streaming LLM Responses Through Workers

Streaming is worth doing properly. A non-streaming LLM response on a 30-second timeout means the user stares at a blank response field for 3-8 seconds. With streaming, they see tokens appearing after 200-400ms.

Workers has first-class streaming support via TransformStream and ReadableStream. Here’s how we proxy an Anthropic streaming response:

// src/completion.ts
export async function streamCompletion(
  query: string,
  context: string,
  env: Env
): Promise<Response> {
  const anthropicResponse = await fetch('https://api.anthropic.com/v1/messages', {
    method: 'POST',
    headers: {
      'x-api-key': env.ANTHROPIC_API_KEY,
      'anthropic-version': '2023-06-01',
      'content-type': 'application/json',
    },
    body: JSON.stringify({
      model: 'claude-3-5-haiku-20241022',
      max_tokens: 1024,
      stream: true,
      system: `You are a helpful assistant. Answer based on the following context:\n\n${context}`,
      messages: [{ role: 'user', content: query }],
    }),
  });

  if (anthropicResponse.ok === false || anthropicResponse.body == null) {
    return new Response('LLM error', { status: 502 });
  }

  // Pipe the Anthropic SSE stream directly to the client
  // Workers won't buffer this; chunks pass through as they arrive
  return new Response(anthropicResponse.body, {
    headers: {
      'content-type': 'text/event-stream',
      'cache-control': 'no-cache',
      'transfer-encoding': 'chunked',
      'access-control-allow-origin': '*',
    },
  });
}

One issue we hit: Anthropic’s SSE format doesn’t match OpenAI’s exactly. If your frontend uses a library that expects OpenAI’s format (like openai-streams or most ChatGPT-style UIs), you need a transform step. We built a lightweight TransformStream that converts content_block_delta events to choices[0].delta.content format:

function anthropicToOpenAITransform(): TransformStream<Uint8Array, Uint8Array> {
  const encoder = new TextEncoder();
  const decoder = new TextDecoder();
  let buffer = '';

  return new TransformStream({
    transform(chunk, controller) {
      buffer += decoder.decode(chunk, { stream: true });
      const lines = buffer.split('\n');
      buffer = lines.pop() ?? '';

      for (const line of lines) {
        if (line.startsWith('data: ') === false) continue;
        const data = line.slice(6);
        if (data === '[DONE]') {
          controller.enqueue(encoder.encode('data: [DONE]\n\n'));
          continue;
        }

        try {
          const event = JSON.parse(data);
          if (event.type === 'content_block_delta' && event.delta?.text) {
            const openAIChunk = {
              choices: [{ delta: { content: event.delta.text }, finish_reason: null }]
            };
            controller.enqueue(encoder.encode(`data: ${JSON.stringify(openAIChunk)}\n\n`));
          }
        } catch {
          // skip malformed chunks
        }
      }
    }
  });
}

This runs inside the TransformStream so it consumes minimal CPU per chunk. The JSON parsing here is on small event objects (50-200 bytes), not the full transcript payload that burned us earlier.

Vectorize: When It’s Right and When It Isn’t

Cloudflare Vectorize is the native vector store for Workers. No network hop to an external service, native binding, and pricing that’s reasonable for smaller corpora.

Vectorize limits as of 2026:

Limit	Value
Max vectors per index	5,000,000
Max dimensions	1536
Max metadata per vector	10KB
Max query `topK`	100
Indexes per account	100

For a RAG system with under 2M documents at 1536 dimensions, Vectorize works well. Here’s the retrieval pattern:

// src/retrieval.ts
export async function retrieveContext(
  query: string,
  tenantId: string,
  env: Env
): Promise<string> {
  // Generate embedding: outbound HTTP, doesn't consume CPU while waiting
  const embeddingResponse = await fetch('https://api.openai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json',
    },
    body: JSON.stringify({
      model: 'text-embedding-3-small',
      input: query.slice(0, 8000),  // token limit guard
      dimensions: 1536,
    }),
  });

  const { data } = await embeddingResponse.json() as { data: Array<{ embedding: number[] }> };
  const embedding = data[0].embedding;

  // Query Vectorize with metadata filter
  const results = await env.VECTORIZE_INDEX.query(embedding, {
    topK: 8,
    filter: { tenantId },
    returnMetadata: 'all',
  });

  if (results.matches == null || results.matches.length === 0) return '';

  // Reconstruct context from metadata
  // Store full chunk text in metadata since R2 reads add latency
  return results.matches
    .map(match => match.metadata?.content as string)
    .filter(Boolean)
    .join('\n\n---\n\n');
}

The pattern of storing chunk text in vector metadata is a tradeoff. Vectorize’s 10KB metadata limit means you need to keep chunks under roughly 2,500 tokens. For most RAG use cases that’s fine. If you need longer context windows per chunk, store chunk IDs in metadata and fetch the full text from R2 or D1 as a second step.

For more on this, read our guide on Fine-Tuning vs RAG vs Prompt Engineering.

Where Vectorize falls short

Metadata filtering in Vectorize is limited compared to Qdrant or pgvector (we covered those trade-offs in depth in our vector database comparison). You can filter on equality and range conditions, but not complex boolean queries across multiple metadata fields. We hit this on a compliance system where we needed to filter by tenantId AND (documentType = 'policy' OR documentType = 'regulation') AND effectiveDate > '2025-01-01'. Vectorize couldn’t express the OR condition in its filter syntax at the time. We moved that index to a self-hosted Qdrant instance and proxied it through Workers with an outbound HTTP call.

The performance difference for that workload: Vectorize was returning results in 15-25ms from the Workers context. The Qdrant call added 30-45ms of network latency to the same Workers function (Qdrant hosted on Fly.io in the same region). Acceptable for our use case, but not zero-cost.

D1 for Conversation State and Agent Memory

Workers has no persistent memory across requests by default. Every request starts fresh. For conversational AI systems, you need somewhere to store session state.

Cloudflare D1 is the native SQLite-compatible database for Workers. It binds directly to the Workers runtime, so reads don’t add network latency beyond the D1 query execution time.

Schema for a basic conversation memory store:

-- D1 schema for conversation state
CREATE TABLE IF NOT EXISTS conversations (
  id          TEXT    PRIMARY KEY,
  tenant_id   TEXT    NOT NULL,
  created_at  INTEGER NOT NULL,  -- Unix epoch seconds
  updated_at  INTEGER NOT NULL,
  summary     TEXT,              -- rolling summary when history gets long
  token_count INTEGER DEFAULT 0
);

CREATE TABLE IF NOT EXISTS messages (
  id              TEXT    PRIMARY KEY,
  conversation_id TEXT    NOT NULL REFERENCES conversations(id),
  role            TEXT    NOT NULL CHECK (role IN ('user', 'assistant', 'tool')),
  content         TEXT    NOT NULL,
  created_at      INTEGER NOT NULL,
  token_count     INTEGER DEFAULT 0
);

CREATE INDEX IF NOT EXISTS idx_messages_conversation
  ON messages (conversation_id, created_at DESC);

CREATE INDEX IF NOT EXISTS idx_conversations_tenant
  ON conversations (tenant_id, updated_at DESC);

One pattern we use to keep token counts sane: when a conversation exceeds 8,000 tokens of history, we run a summarization pass using a cheap model (Claude 3.5 Haiku) and replace the old messages with a single summary message. The summarization happens as a background task using Workers’ waitUntil API, so it doesn’t block the current response:

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    const session = await loadConversation(request, env);
    const response = await streamCompletion(session, env);

    // Summarization runs after response is sent
    // ctx.waitUntil keeps the Worker alive until the promise resolves
    if (session.tokenCount > 8000) {
      ctx.waitUntil(summarizeAndCompressHistory(session.id, env));
    }

    return response;
  }
};

waitUntil is the right tool for anything you want to happen after the response but that shouldn’t delay it: logging, cache warming, background indexing. Don’t block the response on work that doesn’t affect the answer.

The CPU Budget in Practice

The 30ms CPU limit sounds generous until you account for all the synchronous work in a typical request.

On one of our current production systems, I profiled the CPU usage breakdown for a RAG request:

Operation	CPU Time
Request parsing + auth KV check	0.8ms
Input validation + sanitization	0.3ms
Embedding response parsing (JSON.parse on ~4KB)	1.2ms
Context assembly from Vectorize results	0.9ms
D1 query result processing	0.6ms
Response streaming setup	0.2ms
Total synchronous CPU	4.0ms

That leaves 26ms of headroom on a paid plan. Comfortable. But on an earlier version of the same system, context assembly was doing a re-ranking step in JavaScript: cosine similarity recalculation across the returned vectors to apply a custom scoring formula. That single operation cost 12ms of CPU for a top-8 result set at 1536 dimensions. We moved it to a Cloudflare Worker AI custom function (which runs GPU-side) and the problem disappeared.

The lesson: anything involving vectors, matrix operations, or processing many documents synchronously should happen outside the Workers CPU budget. Workers handles I/O-bound work efficiently. CPU-bound work needs to be offloaded or eliminated.

What We Still Get Wrong

Workers and D1 read replicas behave differently across data centers. D1 writes go to the primary (currently US-based). Reads are served from the nearest replica. For a globally-deployed AI assistant, a user in Singapore who writes a message to D1 and then immediately reads it back can sometimes miss their own write if the read hits a replica that hasn’t synced yet. This is documented in Cloudflare’s D1 consistency notes, but we didn’t internalize the implication until a client in Dubai reported intermittent conversation history issues.

The fix: for conversation state that a user just wrote, read from the same request context’s in-memory cache rather than re-querying D1. We maintain a session object in memory for the duration of the request. It’s obvious in retrospect.

There’s also no good answer yet for long-running agent pipelines on Workers. A multi-step agent that calls three tools, waits for external APIs, and generates a final synthesis response can easily take 15-20 seconds of wall-clock time. Workers’ 30-second limit handles this most of the time, but not always. For longer agent pipelines, we run the agent on a separate Fly.io instance and use Workers as the API gateway in front of it. This adds ~20ms of latency for the initial routing, which is acceptable for use cases that already have multi-second response times.

Decision Matrix: When to Use Workers for AI

Not every AI workload belongs on Workers. Here’s what the choice looks like in practice:

Workload	Workers-native	External + Workers gateway
LLM inference (streaming)	Yes	n/a
RAG, < 2M vectors, simple filters	Yes (Vectorize)	n/a
RAG, 5M+ vectors or complex filters	No	Qdrant/pgvector
Conversational AI, < 100k sessions	Yes (D1)	n/a
Conversational AI, high-volume	No	Postgres
Single-step agent (1-2 tool calls)	Yes	n/a
Multi-step agent (5+ tool calls)	Marginal	Fly.io / Railway
Document processing (OCR, heavy parsing)	No	Lambda / Cloud Run
Image generation / multimodal inference	No	Replicate / Modal
Webhook ingestion + async processing	Yes	n/a

The row that surprises people: single-step agents work fine on Workers if each tool call is an outbound HTTP request. The CPU limit applies to synchronous JavaScript, not to wait time on external calls. An agent that calls a search API, processes the results (under 5ms), and generates a response fits comfortably inside the execution model.

Setting Up the Project

For new AI projects on Workers, we start with this structure:

src/
  index.ts          # Main fetch handler
  retrieval.ts      # Vectorize + embedding logic
  completion.ts     # LLM streaming
  memory.ts         # D1 conversation state
  auth.ts           # KV-backed session tokens
  types.ts          # Env interface, shared types
wrangler.jsonc      # Worker config with bindings
schema.sql          # D1 schema (applied with wrangler d1 execute)

The wrangler config with all bindings:

// wrangler.jsonc
{
  "name": "ai-assistant",
  "main": "src/index.ts",
  "compatibility_date": "2026-01-01",
  "compatibility_flags": ["nodejs_compat"],

  "kv_namespaces": [
    { "binding": "SESSIONS", "id": "..." }
  ],

  "d1_databases": [
    { "binding": "DB", "database_name": "ai-conversations", "database_id": "..." }
  ],

  "vectorize": [
    { "binding": "VECTORIZE_INDEX", "index_name": "documents" }
  ],

  "vars": {
    "ENVIRONMENT": "production"
  }
}

Secrets (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.) go in via wrangler secret put, not in the config file. Cloudflare encrypts them at rest and they’re available as env.SECRET_NAME in the Workers runtime.

FAQ

Can I run a Python AI framework like LangChain on Cloudflare Workers?

Not directly. Workers runs JavaScript and TypeScript only (via V8 isolates). LangChain has a TypeScript port (langchain-ai/langchainjs) that works in Workers, though not all integrations are available and some internal operations hit the CPU limit for large document sets. For Python-based AI code, you need a separate compute layer (Fly.io, Cloud Run, or a Lambda function) and use Workers as the API gateway in front of it.

How does the Cloudflare Workers AI service compare to calling OpenAI directly?

Workers AI runs inference on Cloudflare’s own GPU infrastructure, accessible from Workers with zero network latency. The tradeoff: the model selection is more limited than OpenAI or Anthropic, and the most capable models (Claude 3.5 Sonnet, GPT-4o) aren’t available there. We use Workers AI for embedding generation and lightweight classification tasks where latency matters more than capability. For final generation and complex reasoning, we still call OpenAI or Anthropic.

What’s the right approach for handling large documents on Workers?

Split the work. Use Workers as an ingestion endpoint that receives the document, validates it, uploads it to R2 for storage, and then triggers a Cloudflare Queue message for processing. The actual chunking, embedding, and Vectorize upsert happen in a separate Queue Consumer Worker, which has its own CPU budget and can process asynchronously. This pattern keeps the ingestion request fast (under 500ms) and decouples it from the processing time of large documents.

How do I handle rate limiting for LLM APIs in Workers?

Use Cloudflare’s Rate Limiting API for per-user or per-tenant limits. For LLM API rate limits from providers, build a token bucket counter in KV: increment on each request, check the count before making the API call, and return 429 if the tenant has exceeded their per-minute or per-day limit. The KV operation takes under 1ms, which is acceptable overhead for every LLM request.

Does the 128MB memory limit cause problems for RAG systems?

Not usually. The in-memory footprint of a RAG request is: the query embedding (1536 floats = ~6KB), the top-8 retrieved chunks (maybe 8KB of text), and the response streaming buffer (a few KB at a time). That’s well under 128MB. The memory limit becomes a problem if you try to hold large conversation histories in memory, cache large document sets per-request, or load npm packages with large dependency trees. Keep request-level state small and externalize everything else to D1, KV, or R2.

Building an AI product and evaluating whether Cloudflare Workers fits your architecture? Book a 30-minute technical call and I’ll walk through the right infrastructure choices for your workload and team.

#cloudflare workers#ai software development#edge ai#production AI#ai software development solutions#RAG#serverless ai

Stay in the loop

Technical deep-dives and product strategy from the Kalvium Labs team. No spam, unsubscribe anytime.

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

LinkedIn GitHub · About us →

You read the whole thing — that means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

Kalvium Labs

AI products for startups

Keep reading

Technical

AI Evaluation Pipelines: Testing Your Model in Production

Technical

Fine-Tuning vs RAG vs Prompt Engineering: When to Use What

Have a question about your project?

Send us a message. No commitment, no sales pitch. We'll tell you if we can help.

Chat on WhatsApp Book a 30-min Call →

Production AI on Cloudflare Workers: Architecture Guide

Building something with AI?

See how we've built this in production

Free: AI PRD Generator

What Workers Actually Is (and Isn’t)

Request Handling: The Right Split

Streaming LLM Responses Through Workers

Vectorize: When It’s Right and When It Isn’t

Where Vectorize falls short

D1 for Conversation State and Agent Memory

The CPU Budget in Practice

What We Still Get Wrong

Decision Matrix: When to Use Workers for AI

Setting Up the Project

FAQ

Can I run a Python AI framework like LangChain on Cloudflare Workers?

How does the Cloudflare Workers AI service compare to calling OpenAI directly?

What’s the right approach for handling large documents on Workers?

How do I handle rate limiting for LLM APIs in Workers?

Does the 128MB memory limit cause problems for RAG systems?

Stay in the loop

Anil Gulecha

Keep reading

AI Evaluation Pipelines: Testing Your Model in Production

Fine-Tuning vs RAG vs Prompt Engineering: When to Use What

Have a question about your project?