Case Studies
· 9 min read

How We Built an Advanced RAG System for Documents

900 legal contracts, naive RAG at 0.61 precision. Here's what parent-child chunking, hybrid search, and reranking actually fixed. Build story.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
How We Built an Advanced RAG System for Documents
TL;DR
  • Naive 512-token chunking failed on legal contracts because clauses don't fit in 512 tokens. Parent-child chunking (small chunks for retrieval, full sections as context) fixed 80% of the misses.
  • Hybrid search combining BM25 and vector search, then reranked with Cohere, pushed retrieval precision from 0.61 to 0.87 on our golden test set.
  • Tables in PDFs are their own problem. PyMuPDF strips row/column structure. We routed table queries to a separate pdfplumber extraction pipeline.
  • The hardest debugging session: queries about payment terms kept returning jurisdiction clauses. Both contained the word 'applicable'. Fixed with parent-level metadata filters.
  • Final latency: 680ms median for single-contract queries, 2.1s for cross-contract aggregations across 900+ documents.

The first demo went badly.

The client asked: “What are the payment terms across our vendor contracts?” Simple question. We’d ingested 900+ contracts, chunked at 512 tokens with 50-token overlap, embedded with text-embedding-3-small, stored in pgvector. All the right pieces, at least on paper. The system returned four different answers. Two were from termination clauses. One came from a contract that wasn’t even a vendor agreement.

That’s when I realized we’d been treating document intelligence like a general knowledge base problem. It’s not the same thing.

What the Client Actually Needed

They had a legal operations team managing contracts with about 60 active vendors. Most ran 15-40 pages each. The team spent 4-6 hours per week searching manually for specific clauses: payment terms, liability caps, auto-renewal dates, governing law, SLA commitments.

The ask was simple: “Can we ask questions about our contracts in plain English and get accurate answers?” They set a concrete bar: 85%+ accuracy on a test set they’d build from 100 representative questions. Below that, they wouldn’t deploy.

We started with what seemed like a reasonable stack: PDF extraction with PyMuPDF, 512-token overlapping chunks, text-embedding-3-small embeddings, pgvector for retrieval, top-5 chunks as context, GPT-4o for generation. Got it running in three days. That’s when the demo happened.

Our precision on their 100-question golden test set: 0.61. Barely better than guessing on a closed-domain corpus.

Why Naive RAG Failed on Contracts

Legal documents have structure that matters at a different granularity than raw text.

A payment clause might span pages 8-9, covering 800 tokens total. Split it at 512 tokens and you’ve cut it mid-clause. The first chunk has “payment terms: net 60” but not the exceptions. The second chunk has the exceptions but not the framing. Retrieve either alone and you get a partial answer. Feed both to the LLM and it sometimes decides they’re contradictory, because the language across the break is ambiguous without the surrounding context.

Cross-references made it worse. Legal text constantly says “as defined in Section 3.2” or “subject to the provisions of Schedule B.” If Section 3.2 is in a different chunk from the reference, the retrieved passage is incomplete by design. The LLM can’t follow the reference.

The vocabulary problem was the most subtle. Legal text reuses words constantly: “applicable,” “party,” “agreement,” “obligations.” In a 512-token chunk, any query about payment terms that contains the word “applicable” will retrieve chunks from every other clause that also contains “applicable.” That’s why queries about payment terms kept pulling jurisdiction language. Both used nearly identical legal boilerplate around the word “applicable governing law.”

Anil covered the baseline RAG stack we start with in detail. The problem here wasn’t the stack. It wasn’t designed for documents where semantic similarity and keyword matching both matter, but at different levels.

Parent-Child Chunking: What Actually Fixed It

We rebuilt the chunking strategy entirely.

Instead of uniform 512-token windows, we switched to parent-child chunking:

  • Child chunks: 150-200 token windows with 50-token overlap, used only for embedding and vector retrieval.
  • Parent chunks: full contract sections, extracted by detecting heading and paragraph boundaries. Average size 600-1,200 tokens, roughly one clause per parent.

When retrieval finds a child chunk, we return the entire parent chunk to the LLM as context. The child chunk gives precision for the similarity match. The parent chunk gives enough surrounding text for the LLM to actually reason about the clause.

This required reliably detecting section boundaries from PDFs that weren’t always consistent. Some contracts used numbered sections (1.1, 1.2.3). Others used titled sections (PAYMENT TERMS, CONFIDENTIALITY). We built a section detector that combined: font size changes in the PDF metadata, bold text patterns, capitalization heuristics, and the table of contents when present.

The section detector failed on about 8% of contracts. Those were almost all scanned-from-paper documents rather than born-digital PDFs. For those, we fell back to a semantic section splitter using a fine-tuned passage segmentation model from LlamaIndex. Not perfect, but better than fixed-width chunking on documents with no reliable heading structure.

After parent-child chunking: precision up to 0.74.

Hybrid Search and Reranking

Vector search is good at meaning. BM25 is good at exact terms. Legal documents need both: find chunks semantically about payment, but also containing the specific phrase “net 60” or the exact vendor name as written.

We added BM25 using PostgreSQL’s full-text search alongside pgvector. The two searches ran in parallel, each returning top-20 results, then merged with reciprocal rank fusion. Simple algorithm, surprisingly effective. The merged top-40 went to Cohere’s reranker.

The reranker runs as a cross-encoder: it scores each query-passage pair together, which is more accurate than the dot-product comparison a vector index uses. The trade-off is latency. Reranking 40 passages adds 150-200ms per query.

We ended up reranking only the top-10 from the fusion step rather than all 40. Precision dropped slightly (0.87 vs 0.89 with full reranking) but query latency dropped by 140ms. For single-contract queries, 680ms felt responsive. Cross-contract aggregations across the full 900-document corpus ran at 2.1s median, which was inside the client’s acceptable threshold.

After hybrid search and reranking: precision 0.87 on the golden test set. Past the 85% bar.

Tables in PDFs: Still Partially Unsolved

The last major failure mode was contracts with payment schedules formatted as tables.

PyMuPDF extracts tables as flat text, stripping row/column structure. A payment schedule with three columns (Milestone, Amount, Due Date) comes through as a single string with all values concatenated. An embedding of that string doesn’t preserve the row relationships. Queries about specific milestone payments returned inconsistent answers. The model couldn’t tell which amount corresponded to which milestone from the flat text representation.

We built a table detection step: identify page regions that looked like tables by spacing patterns and PyMuPDF’s block layout metadata, then re-extract those regions with pdfplumber, which handles table structure better for most PDF formats. Table queries got routed to a separate extraction pipeline using GPT-4o with a table-aware prompt that explicitly described the row-column structure before asking the question.

The routing logic was simple: if the query contained financial amounts, dates, or milestone keywords, run table extraction alongside the vector search, then merge both context streams before generation.

I don’t have clean precision numbers for table queries specifically. We didn’t have enough table-specific examples in the golden set to measure separately. That’s on the list for the next iteration.

What the Numbers Looked Like

After the full stack (parent-child chunking, hybrid search, reranking, table routing):

  • Retrieval precision (100-question golden set): 0.61 naive → 0.87 final
  • Median query latency (single contract): 2.9s → 680ms
  • Median query latency (cross-contract aggregations): 2.1s (not in original scope, added later)
  • Queries returning results from the wrong contract: 22% → 3%

The 3% false positive rate still bothers me. I don’t have a great solution for it. The remaining errors are mostly cases where two different contracts use nearly identical boilerplate for different purposes, and the system retrieves the wrong document. Adding contract-level metadata filters (auto-apply vendor name and contract date as hard filters when the query mentions a specific vendor) helps, but requires knowing which vendor the user is asking about, and that’s not always explicit in the query.

That’s an honest “we’re still working on it.”

FAQ

At what document volume does a custom RAG build make more sense than an off-the-shelf tool?

Roughly 200-300 documents is where custom starts making sense, assuming you have specific query types (clause extraction, cross-document comparison, metadata-filtered search) rather than general Q&A. Below that, a tool like Notion AI or ChatGPT document upload can handle the volume. Above 500 documents with compliance constraints that prevent third-party data sharing, custom is usually the right answer. The other hard trigger: when your query types don’t fit the tool’s assumption about what questions users ask. Most tools are built for “summarize this document,” not “compare payment terms across all contracts expiring in Q4.”

How much does building a RAG system for documents cost?

For a project at this scope (900 contracts, custom chunking, hybrid retrieval, table extraction), the build ran as a fixed-bid engagement in our medium range ($15-25K). Ongoing hosting costs under $200/month: a managed Postgres instance with pgvector, Cohere reranker API calls, and LLM API costs at roughly $0.002-0.005 per query depending on document length. Larger corpora (10,000+ documents) add indexing time and storage but minimal ongoing cost per query.

When does RAG work well for documents, and when doesn’t it?

RAG works well when your documents have consistent structure, your queries are specific and bounded, and you can define a golden test set before building. It struggles when documents are highly varied in format, when more than 30% are scanned images (OCR quality becomes the ceiling), or when queries require multi-hop reasoning across many documents simultaneously. If you’re mostly asking “find me the exact clause about X in contract Y,” RAG is the right tool. If you’re asking “compare how these 20 contracts differ on indemnification,” you probably need a hybrid extraction + aggregation approach, not pure retrieval.

How is this different from uploading contracts to ChatGPT?

Volume, precision, and compliance. ChatGPT document upload is reasonable for single-document Q&A on shorter files. For 900+ contracts with cross-document queries, a custom retrieval system with proper indexing, evaluation, and source attribution is the correct approach. More importantly: most legal operations teams can’t send contracts through a third-party consumer product for compliance and confidentiality reasons. On-premises or private-cloud deployment is a hard requirement for most enterprise contract workflows.


If you’re evaluating a document intelligence build for your team, the 30-minute version of this conversation is: what’s your document volume, what queries matter most, and what’s the acceptable error rate for your workflow. Book a call and we’ll tell you honestly whether it’s a 2-week project or a 2-month one.

#rag development#custom ai solution#document intelligence#vector search#hybrid search#llm#case study#pgvector
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us