Technical
· 15 min read

LLM Structured Output: JSON Mode vs Function Calling

JSON mode, function calling, and Pydantic tool use compared: failure rates, latency costs, and when each method actually holds in production AI systems.

Anil Gulecha
Anil Gulecha
Ex-HackerRank, Ex-Google
Share
LLM Structured Output: JSON Mode vs Function Calling
TL;DR
  • JSON mode guarantees valid JSON syntax but not schema conformance. We see 8-12% schema violation rates when relying on JSON mode alone without a schema definition.
  • Function calling (tool use) with a strict schema reduces schema violation rates to under 0.3% on GPT-4o and Claude Sonnet, but adds 80-150ms of latency per call.
  • The hardest failure mode is silent schema drift: the model returns valid JSON that matches your schema but with semantically wrong values. No error is raised. Your downstream code gets bad data.
  • Nested schemas with more than 3 levels of depth fail at disproportionately higher rates across all models. Flatten your schemas.

Six months ago we had a production chatbot returning malformed JSON about 4% of the time. Not a parsing error, not an exception. Just a JSON object with an extra field the client code wasn’t expecting, silently dropping data for one in twenty-five responses.

We found it because a client’s analytics dashboard had been showing wrong numbers for three weeks. The root cause: the model was occasionally including a reasoning field we’d never defined in our schema, the client’s parser was using strict mode, and those responses were being silently discarded instead of counted.

That’s the kind of structured output failure that doesn’t show up in your error logs. It shows up in your business metrics, weeks later. (It’s also worth noting: structured output is one of the most effective LLM output guardrail mechanisms because it physically prevents the model from generating off-schema content.)

This post covers the three approaches to getting structured output from LLMs, what actually breaks with each, and the patterns that hold across our production deployments.

The Three Methods

Before getting into failure modes, it’s worth being precise about what each approach actually does.

JSON mode (available on OpenAI, Mistral, Groq, and several others) constrains the model’s output tokenizer to only emit tokens that form valid JSON. It guarantees syntactic correctness: no trailing commas, no unquoted keys, balanced brackets. It does not guarantee schema conformance. The model can return valid JSON with arbitrary fields, missing required fields, or wrong types.

# OpenAI JSON mode
response = client.chat.completions.create(
    model="gpt-4o",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "Extract product info. Return JSON with name, price, and category."},
        {"role": "user", "content": product_description}
    ]
)
# Guaranteed: valid JSON
# Not guaranteed: has name/price/category, price is a number, category is from your enum

Function calling / tool use defines a JSON Schema in your API call and forces the model to fill it. OpenAI calls this “tools”, Anthropic calls it “tool use”. Both work the same way: you define what the output object should look like, and the model’s generation is constrained to that schema during decoding. (If you’re building agents that use tool calls for action execution rather than data extraction, the tool schema design patterns are different but the schema discipline is the same.)

# OpenAI function calling
response = client.chat.completions.create(
    model="gpt-4o",
    tools=[{
        "type": "function",
        "function": {
            "name": "extract_product",
            "parameters": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "price": {"type": "number", "minimum": 0},
                    "category": {"type": "string", "enum": ["electronics", "clothing", "food", "other"]},
                },
                "required": ["name", "price", "category"],
                "additionalProperties": false,
            }
        }
    }],
    tool_choice={"type": "function", "function": {"name": "extract_product"}},
    messages=[{"role": "user", "content": product_description}]
)

OpenAI’s Structured Outputs (launched August 2024) is a stricter version of function calling. When you set strict: true on a function definition, the model is constrained even more tightly: it cannot add properties not in the schema, all required fields must be present, and enum values are enforced at the token level. This is the most constrained option available today.

# OpenAI Structured Outputs (strict mode)
"function": {
    "name": "extract_product",
    "strict": True,  # The difference
    "parameters": { ... }
}

Anthropic doesn’t have an exact equivalent of strict mode, but when you use tool_choice to force a specific tool and define the schema with additionalProperties: false, the behavior in practice is very close.

Failure Rate Data

We instrumented 4 production systems across three months, logging every structured output call, the schema definition, and whether the returned object was schema-valid when parsed with Pydantic.

MethodModelSchema Violation RateParse Error RateNotes
JSON mode (no schema)GPT-4o9.3%0.1%Schema violations mostly: extra fields, wrong types
JSON mode (no schema)Claude Sonnet11.7%0.05%Higher extra-field rate than GPT-4o
Function callingGPT-4o1.2%0%Violations: mostly enum mismatch on ambiguous inputs
Function callingClaude Sonnet2.1%0%Higher violation rate than GPT-4o on nested schemas
Structured Outputs (strict)GPT-4o0.2%0%Violations only on deeply nested nullable fields
Tool use + additionalProperties: falseClaude Sonnet0.3%0%Close to GPT-4o strict

Two things stand out. First, JSON mode without a schema definition is worse than I expected. Nearly 10% of responses fail schema validation even when the system prompt explicitly describes the expected structure. The model follows the spirit of the instructions but not the letter.

Second, strict function calling on GPT-4o reaches 0.2% violation rate. That’s 1 in 500 responses. For most production applications, that’s acceptable. For high-stakes pipelines (compliance scoring, financial data extraction), it still isn’t.

Where Function Calling Breaks

The 0.2-2.1% violation rate with function calling is concentrated in specific patterns. These aren’t random failures. They’re predictable.

Deeply Nested Schemas

Schemas with 3+ levels of nesting fail at 3-5× the rate of flat schemas.

# This is fine: 2 levels
class Address(BaseModel):
    street: str
    city: str
    country: str

class Contact(BaseModel):
    name: str
    address: Address

# This breaks at higher rates: 4 levels
class LineItem(BaseModel):
    product: Product        # 1
    quantity: int
    pricing: PricingDetail  # 2
    
class PricingDetail(BaseModel):
    base: float
    discounts: list[Discount]  # 3

class Discount(BaseModel):
    rule: DiscountRule   # 4 (this level degrades noticeably)
    amount: float

We’ve confirmed this with systematic testing across 500 schemas of varying depth. At 3 levels deep, violation rates stay near the baseline. At 4 levels, violation rates roughly double. At 5+ levels, they can reach 8-12%, approaching JSON mode territory.

The fix: flatten your schemas wherever possible. If you need deeply nested data, consider breaking it into multiple extraction calls. Two clean extractions are more reliable than one deeply nested one.

Nullable Fields With Enums

The combination of nullable + enum is a consistent trouble spot:

class ProductStatus(BaseModel):
    status: Optional[Literal["active", "discontinued", "pending"]] = None

When the model is genuinely uncertain about the status, it sometimes returns "unknown" or "n/a" instead of null. These are semantically reasonable choices that fail schema validation.

We handle this with a post-processing validator that maps known near-misses before Pydantic validation:

STATUS_COERCION = {
    "unknown": None,
    "n/a": None,
    "not specified": None,
    "unavailable": None,
    "active_discontinued": "discontinued",  # One model kept generating this
}

def coerce_nullable_enum(value: str | None, valid_values: list[str]) -> str | None:
    if value is None:
        return None
    if value in valid_values:
        return value
    return STATUS_COERCION.get(value.lower(), None)  # Default to null, not error

This coercion layer alone cuts our enum-related violations by about 70%.

Long Documents With Many Fields

Extraction schemas with 15+ fields applied to long documents (3,000+ tokens) fail at elevated rates. The model starts “losing” fields toward the end of the schema. Later-defined required fields are more likely to be missing or wrong than earlier-defined ones.

This is a context attention problem, not a schema problem. The model attends to the earlier parts of its output more than the later parts when generating structured responses.

Two mitigations:

  1. Put your most important fields first in the schema definition. Required, non-nullable business-critical fields go at the top. Optional metadata goes at the bottom.

  2. Split large schemas into focused extractions. For a 20-field schema, run two 10-field extractions on the same document. The overhead is roughly one extra API call (150-250ms, $0.001-0.003), but the violation rate on each half-schema drops back to baseline.

The Silent Failure: Semantic Drift

All the metrics above measure schema conformance. The harder problem is semantic correctness: the model returns valid, schema-conformant JSON where the values are wrong.

No error is raised. Pydantic validation passes. Your application gets bad data.

This came up on a contract clause extraction pipeline. The schema had a risk_level: Literal["low", "medium", "high"] field. The model consistently returned schema-valid values, but on ambiguous clauses, different runs would return different risk levels. The issue wasn’t syntax. It was that the model’s notion of “medium risk” didn’t match the client’s legal team’s definition.

You can’t catch this with schema validation. You need semantic validation.

from pydantic import BaseModel, field_validator

class ContractClause(BaseModel):
    clause_text: str
    risk_level: Literal["low", "medium", "high"]
    justification: str  # Force the model to explain its reasoning
    
    @field_validator("risk_level")
    @classmethod
    def validate_risk_consistency(cls, v, values):
        # If the model says high risk but gives a short justification,
        # something is off. Short justification = model is uncertain.
        if v == "high" and "justification" in values.data:
            if len(values.data["justification"]) < 50:
                raise ValueError("High-risk classification requires substantive justification")
        return v

Adding a justification field serves two purposes. It forces the model to articulate its reasoning, which improves accuracy (this is just chain-of-thought embedded in structured output). And it gives you a signal to detect low-confidence outputs: short justifications correlate with uncertain classifications across our datasets.

We can’t quantify semantic drift with a single number because it’s application-specific. What we’ve found: adding a justification or confidence field to any classification output catches 30-40% of semantically wrong classifications before they propagate.

Latency Cost of Structured Output

Function calling isn’t free. Here’s what it costs on average across our production systems (measured over 30 days):

MethodSchema FieldsAvg Latency Addedp95 Latency Added
JSON modeN/A~0ms (tokenizer constraint)~10ms
Function calling5-10 fields+80ms+150ms
Function calling11-20 fields+120ms+220ms
Function calling21+ fields+180ms+350ms
Structured Outputs (strict)5-10 fields+90ms+160ms
Multi-call extractionN/A+250-400ms total+500ms

The latency cost is real but manageable for most chatbot and extraction workloads. A 120ms overhead on a 1.5 second response is a 8% increase, which most users don’t notice.

Where it gets painful is high-throughput pipelines running 10,000+ extractions per hour. At that scale, the latency cost doesn’t matter (async processing absorbs it), but the cost overhead of function calling vs JSON mode does. Function calling typically processes fewer tokens per call, so it’s actually cheaper for extraction tasks where the output would otherwise include explanatory prose.

We’ve never chosen JSON mode over function calling for cost reasons. The schema violation rate difference (9% vs 0.2%) makes function calling the correct choice on every workload except rapid prototyping.

When JSON Mode Is Actually the Right Choice

Despite everything above, there are legitimate cases for JSON mode:

Exploratory extraction where schema is evolving. When you’re building a new extraction pipeline and the output schema changes every day, strict function calling adds friction. Define the rough shape in a system prompt and use JSON mode while you’re iterating. Migrate to strict function calling when the schema stabilizes.

Conversational response with partial structure. Some chatbot responses are mostly free text with structured metadata at the end. For example: a customer support response that needs a sentiment tag and an escalation flag, but the main response body is freeform prose. Mixing JSON mode with a short, well-defined schema in the prompt can handle this cleanly without the overhead of defining a full function schema.

Models that don’t support function calling. Some open-source models and hosted endpoints only support JSON mode. In those cases, you compensate with a stricter system prompt and a Pydantic validation layer that retries on failure.

The one thing I’d push back on: “we use JSON mode because it’s simpler” is not a good enough reason in production. The 8-12% schema violation rate will find you, usually when you’re not looking.

The Retry Pattern

Even with strict function calling, you get occasional violations (0.2-2.1%). The standard response is a retry with error feedback injected:

async def extract_with_retry(
    document: str,
    schema: type[BaseModel],
    max_retries: int = 2,
) -> BaseModel | None:
    
    for attempt in range(max_retries + 1):
        response = await call_with_function_calling(document, schema)
        
        try:
            result = schema.model_validate(response)
            return result
        except ValidationError as e:
            if attempt == max_retries:
                # Log and return None. Don't raise, don't return garbage.
                logger.error("Extraction failed after %d retries: %s", max_retries, e)
                return None
            
            # Inject the validation error into the next attempt
            error_feedback = format_validation_errors(e)
            document = f"{document}\n\n[Previous attempt returned: {response}. Validation errors: {error_feedback}. Please fix these.]"
    
    return None

A few notes on this pattern. First, the error feedback injection actually works. The model can read its previous output and the validation errors and usually corrects the issues on the second attempt. We see about 80% of first-attempt failures resolved on retry.

Second, cap retries at 2. A third retry almost never helps and doubles the latency cost for that request. If two attempts fail, the input is probably ambiguous or outside the expected distribution. Log it, return None, and investigate the pattern.

Third, never silently return invalid data. The temptation is to return the partially-valid object rather than None. Don’t. Downstream code that expects valid data will fail in strange ways. An explicit None is a bug you can see; a subtly wrong data object is a bug that hides for weeks.

The Part We Don’t Have Good Answers For

Schema versioning in production. We have clients running extraction pipelines where the source documents evolve (new fields appear, terminology shifts, edge cases accumulate) and the schema has to evolve with them. Updating the schema breaks historical records. Not updating it means the model makes up values for fields that don’t map cleanly to the new document format.

We’ve tried maintaining versioned schemas (v1, v2, v3 of the same extraction) with a routing layer that selects based on document age and type. It works, but it’s operationally messy. Every time you add a new version, you have to validate that older document types still extract correctly against older schemas.

Nobody has solved this cleanly. If you have a solution that works at scale, I’d genuinely like to hear it.

FAQ

When should I use JSON mode vs function calling for a production AI chatbot?

Function calling for anything in production where schema correctness matters. JSON mode is fine for prototyping and for cases where your schema changes daily. The schema violation rate difference (9-12% vs 0.2-2.1%) justifies the extra setup cost of defining a function schema. The one exception: if you’re using a model that doesn’t support function calling and can’t switch, compensate with a stricter system prompt and a Pydantic validation layer with retries.

How do I handle streaming responses with structured output?

Streaming and strict structured output don’t mix well. When you need the full JSON object before you can validate it, streaming gives you nothing until the final token anyway. For function calling, disable streaming (stream=False) and accept the slightly higher time-to-first-token. For JSON mode, you can stream and buffer the tokens, then parse the complete JSON on the final chunk. But streaming JSON mode with schema validation means you can’t catch schema violations until the very end, which is the same as not streaming.

What’s the right way to handle extraction failures in a high-volume pipeline?

Log the failure, return None or a failure sentinel, and add the failed document to a review queue. Don’t raise exceptions in the hot path. Don’t retry more than twice. Don’t silently substitute default values. The failure queue should be monitored weekly: look for patterns in what’s failing (document type, length, source), because systematic failures usually indicate a schema design problem, not a model problem.

Does schema complexity affect cost as well as latency?

Yes, but in ways that cut both directions. More complex schemas add tokens to the function definition in every request (these are input tokens, billed accordingly). A 20-field schema definition might add 200-400 input tokens per call. At GPT-4o pricing ($2.50/M input tokens), that’s about $0.001 per 1,000 extra calls. Negligible. But complex schemas also tend to produce shorter output than equivalent free-text extraction, because the model isn’t generating explanatory prose. For high-volume extraction, function calling is often cheaper than JSON mode + free-text output.

How does structured output work with open-source models through vLLM or Ollama?

vLLM supports JSON mode and a subset of function calling through its OpenAI-compatible API. Ollama (recent versions) supports JSON mode via format: "json". Neither supports the full strict-mode function calling that OpenAI and Anthropic offer. In practice: open-source models with JSON mode show higher violation rates than commercial models (we see 15-20% on Llama 3.1 70B vs 9-12% on GPT-4o JSON mode). To compensate, run a heavier Pydantic validation pass with retries, or use Outlines (a constrained decoding library that enforces schema at the token level for locally-run models).


Building AI pipelines where data reliability matters? Structured output architecture is one of the first things we get right, because downstream failures are the hardest to debug. Book a technical call and we can walk through the extraction pattern that fits your use case.

#structured output#json mode#function calling#ai chatbot development#LLM#production AI#pydantic
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Anil Gulecha

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us