Case Studies
· 10 min read

How We Built the Content Engine That Powers Fertilia Health

From 0 to 5,000 weekly Google impressions in 5 weeks. The architecture, the wrong turns, and the physician-review queue that made it work.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
How We Built the Content Engine That Powers Fertilia Health
TL;DR
  • Built a daily-publishing content engine for an Indian fertility practice. 0 to 5,000 weekly Google impressions in 5 weeks, $0 ad spend, 109 consultation clicks
  • First version was a one-shot LLM generator. Medical accuracy was 64%. We tore it down and rebuilt around a quality-gate pipeline that catches structural issues before the doctor sees them
  • The hardest engineering problem wasn't generation. It was the physician-review queue: how do you give a busy clinician 15 minutes per post and have her output land in production
  • Runs on free-tier Cloudflare Workers + Resend + Google Search Console API. Marginal compute cost is $0. The system pulls its own performance data and adjusts topic selection weekly
  • Five weeks in, ChatGPT started citing the posts as sources. We didn't optimize for that. Well-structured medical content with citations is what LLMs reach for

The first auto-published post from the Fertilia Content Engine hit Google’s index on a Wednesday. By that Friday it was ranking position 8 for “PCOS breakfast ideas” in India. By the end of week two it was at position 3. By the end of week five, top three for the core PCOS keywords, with 5,000 weekly impressions across the site.

The system that produced it had been running for 22 days. Dr. Suganya, the OB-GYN whose practice it serves, spent about 15 minutes a day reviewing drafts. Nothing else about the operation required her time.

This is the build story. What we tried first that didn’t work, the architecture that actually shipped, and the parts I’m still iterating on six months later. If you want the surface results, the case study page has the full numbers.

If you want the build-vs-buy framing for whether something like this makes sense for your business, Venkat covered that in a separate post. This post is the engineering view.

The Problem

A women’s health practice in Coimbatore. The doctor had built a 16,000-follower Instagram presence over years of authentic clinical content. Instagram reach is rented (the algorithm decides who sees what), and meanwhile thousands of women per month were searching Google for the exact topics she could speak to (PCOS diet, fertility after 35, ovulation patterns, postpartum recovery), and finding competitors instead.

The constraints were specific:

  1. Daily publishing cadence. Anything less and we wouldn’t compound fast enough to compete with established practices in a year.
  2. Medical accuracy. A single inaccurate health claim in a doctor’s name is a real problem. Generic AI health content is worse than no content.
  3. The doctor had no time for a content workflow. She has patients. Whatever we built had to need her for clinical judgment only, not for fixing structural problems a checklist could catch.
  4. Budget had to fit a single-practice clinic, not a venture-backed B2B SaaS company.

What We Tried First (That Didn’t Work)

The first version of the system was the obvious one. It was wrong, and tearing it down was the right call.

Wrong turn 1: One-shot LLM generation with a medical-claims lint.

Day three of the build. We had a prompt that took a topic and produced a 1,500-word draft, and a follow-up prompt that scored the draft for medical accuracy against a hand-written claims list. Output drafts were cleanish. Accuracy on a labeled ground-truth set: 64%.

That’s not a content system. That’s a draft generator with a janky checker bolted on.

The failures were specific. Confident-sounding medical claims with no citation. Therapy recommendations the doctor didn’t endorse. Cultural mismatches (recommending diet patterns that don’t exist in Indian households). And the lint pass missed most of these because it was scoring the same generation pass that had produced the errors. You can’t have one model grade its own homework reliably.

We rebuilt around a multi-stage pipeline where each stage has a different job and different prompts. The stages catch each other’s failures. That took the medical-accuracy score from 64% on the first version to 92% on the third revision. The remaining 8% is what the physician review queue is for.

Wrong turn 2: Building our own keyword tool because the public APIs felt slow.

We spent half a day starting to scrape and rank-aggregate keyword data ourselves. This was a vanity engineering choice. Google’s Keyword Planner API returns volumes that are good enough for topic prioritization, and the Search Console API gives the actual rank-position-and-impression data once posts are live. We threw away the scraping code and used both as designed. Lost half a day, gained a defensible data pipeline.

The Architecture That Actually Shipped

The pipeline has seven stages. The boring stages are the ones that matter most.

Topic Selection → Brief Generation → Draft Generation → Quality Gates →
Physician Review Queue → Publishing → Performance Feedback Loop

Stage 1: Topic selection. Pulls keyword data from Google’s Keyword Planner API for the practice’s specialty cluster. Filters by search volume (minimum threshold, varies by niche), keyword difficulty, and topical relevance to the doctor’s services. Drops topics that are already covered (vector similarity search against existing posts, using text-embedding-3-small). Surfaces the top 20 candidates per week. Output is a ranked CSV the system reads from.

Stage 2: Brief generation. Each topic gets a brief: target keyword, suggested H2s based on the SERP top 10, a ‘must include’ list of citations from medical sources (PubMed, Cochrane, ICMR for India-specific content), and a ‘must avoid’ list of contraindications. The brief is what feeds the draft prompt.

Stage 3: Draft generation. Claude Sonnet, structured prompt, brief as input. The prompt is tuned for the doctor’s voice (we trained it on 40 of her existing Instagram captions and a clinical-tone reference she approved). Output is markdown with required citations.

Stage 4: Quality gates. This is where most of the engineering effort went. Before any draft reaches the doctor, it runs through automated checks:

  • Citation presence (every claim with a number or a clinical assertion must have a citation; we use a separate LLM pass for this scoring).
  • Brand tone consistency (cosine similarity against a reference embedding from her approved posts).
  • Scope boundary check (flags any draft that strays into territory outside her practice’s specialty, e.g., pediatric oncology or orthopedics, since those carry liability).
  • Structural validation (H1, H2 hierarchy, FAQ schema, internal-linking targets).
  • Sensitive-claims filter (specific phrases that need clinician review even if they pass other gates).

In the first audit of these gates, automated checks caught issues in roughly 40% of drafts before they reached the physician for review. Her review time gets spent on clinical judgment, not on triaging structural problems.

Stage 5: Physician review queue. This was the hardest engineering problem and the one that took the most iteration. The doctor is a clinician, not a CMS user. Drafts can’t reach her as wireframed editor screens. They reach her as clean PDF documents with three buttons: approve, edit, reject. Edits go back into the draft as tracked changes. Approval triggers publish. Reject sends the topic back to stage 1 with a reason code. Average review time stabilized at 11 minutes per post once the quality gates were tuned.

Stage 6: Publishing. Approved drafts compile to static HTML, deploy to a Cloudflare Worker (Astro static output), and submit URLs to Google Search Console via the Indexing API. Sitemap regenerates and pings Bing Webmaster Tools too, though that auth path has been flaky for us (separate longstanding issue).

Stage 7: Performance feedback loop. Every Sunday, the system pulls the past 7 days of impression and click data from Search Console, joins it back to topic metadata, and adjusts next week’s topic priorities. Posts ranking 5–15 with rising impressions get expanded coverage; topics with zero traction after 4 weeks get retired; net-new opportunities surface from the rising-queries report.

What’s Actually Running

The whole system runs on free-tier infrastructure. Marginal compute cost is $0 (the practice pays only the engagement fee, not the infrastructure). Specifically:

  • Hosting: Cloudflare Workers + Pages (free tier handles the impression volume comfortably).
  • Email/lead capture: Resend for the lead-magnet email flow (3,000/mo free, more than enough for current volume).
  • Database: Cloudflare D1 (SQLite at the edge) for the topic queue and review-state tracking.
  • LLM calls: Anthropic API for drafting + grading. The marginal cost per post is roughly $0.40 in tokens. At 7 posts/week that’s $11.20/month, billed to us not the client.

The engagement fee covers the build amortization, the model costs, and ongoing system maintenance. The “$0 infrastructure” line in the case study isn’t marketing copy, it’s literal: AWS or GCP would multiply this by 10–20x with no upside.

What I’m Still Iterating On

Six months in, two parts of the system still need work:

The brand-tone embedding drift. As Dr. Suganya edits drafts, her edits are subtly retraining the system’s tone reference. After about three months we noticed the cosine similarity threshold was passing drafts that felt ‘flatter’ than her original voice. We now retrain the reference embeddings monthly using the most recent 20 approved posts as the anchor set. Open question: should that be weekly?

Topic selection feedback loop attribution. When a post drives a consultation click, attributing that back to the topic-selection algorithm requires the right join across page views, scroll depth, and WhatsApp click events. We’ve got it working, but the attribution window is fragile (we use 14 days; some patients read multiple posts over a longer cycle). I’m not sure 14 days is right.

Both are tractable. Neither is blocking results. They’re the kind of thing you don’t notice when you launch but matter at month six.

What Surprised Me

Five weeks after launch, ChatGPT started citing Fertilia’s posts as sources. We didn’t optimize for that. The posts have proper structured data, real citations, and consistent author attribution, which is what LLMs surface as authoritative. Now there are 40+ monthly visits from ChatGPT alone, with smaller numbers from Microsoft Copilot. AI search visibility wasn’t on the brief. It’s becoming a meaningful share of traffic anyway.

The other surprise: the publishing rhythm forced compounding even when individual posts didn’t take off. Five weeks of consistent daily publishing, with the feedback loop nudging topic selection toward what was working, beat any individual breakout post. Week one had 120 impressions. Week five had 5,025. The math is geometric, not linear, when topic selection is data-driven.

FAQ

How long did the build take, end to end?

Three weeks of focused engineering for the first running version, then two weeks of iteration on the quality gates and physician-review UI. The system went live for daily auto-publishing at week four. The first measurable impression curve started in week five.

Could you do this without the physician review step?

Technically yes. The quality gates would still catch most structural and accuracy issues. We would not. For a healthcare practice, the doctor’s clinical sign-off isn’t a feature, it’s the entire trust model. For a non-healthcare niche where claims are less liability-loaded, the review step can be lighter (an editor sampling instead of approving every post). The architecture stays the same.

What’s the role of human writing in this system?

Roughly 11 minutes per post, all from the doctor, all clinical-judgment work. She corrects medical framing, sometimes adds a clinical observation she wants in, occasionally rejects a topic if it’s outside her practice approach. She doesn’t write paragraphs from scratch. Drafts that need that level of intervention go back into the queue with a reason code.

Why Cloudflare Workers instead of a more traditional stack?

Cloudflare Workers gives us global edge deployment on a free tier that handles the traffic comfortably, plus zero cold starts. For a daily-publishing site with growing organic traffic, the latency profile and the infrastructure cost both matter. We’ve used Vercel and Netlify for similar builds; for this one, Workers won on the free-tier ceiling and the request-volume math.

What would you do differently if you started this build today?

Two things. First, I’d build the physician-review queue before the generation pipeline, not after. We built generation first because it felt like the hard problem, but the review queue ended up being the bottleneck for getting to production. Build the constrained piece first. Second, I’d version the brand-tone reference embedding from day one, not from when we noticed drift. Cheap to do up front, painful to retrofit.


If this kind of system would help your business and you want to see what the keyword opportunity actually looks like in your niche, book a 30-minute call. We’ll pull live data during the call, and you’ll see the topic landscape before we discuss anything else.

#ai content engine#content automation platform#case study#ai content marketing#seo automation#physician review
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us