The first auto-published post from the Fertilia Content Engine hit Google’s index on a Wednesday. By that Friday it was ranking position 8 for “PCOS breakfast ideas” in India. By the end of week two it was at position 3. By the end of week five, top three for the core PCOS keywords, with 5,000 weekly impressions across the site.
The system that produced it had been running for 22 days. Dr. Suganya, the OB-GYN whose practice it serves, spent about 15 minutes a day reviewing drafts. Nothing else about the operation required her time.
This is the build story. What we tried first that didn’t work, the architecture that actually shipped, and the parts I’m still iterating on six months later. If you want the surface results, the case study page has the full numbers.
If you want the build-vs-buy framing for whether something like this makes sense for your business, Venkat covered that in a separate post. This post is the engineering view.
The Problem
A women’s health practice in Coimbatore. The doctor had built a 16,000-follower Instagram presence over years of authentic clinical content. Instagram reach is rented (the algorithm decides who sees what), and meanwhile thousands of women per month were searching Google for the exact topics she could speak to (PCOS diet, fertility after 35, ovulation patterns, postpartum recovery), and finding competitors instead.
The constraints were specific:
- Daily publishing cadence. Anything less and we wouldn’t compound fast enough to compete with established practices in a year.
- Medical accuracy. A single inaccurate health claim in a doctor’s name is a real problem. Generic AI health content is worse than no content.
- The doctor had no time for a content workflow. She has patients. Whatever we built had to need her for clinical judgment only, not for fixing structural problems a checklist could catch.
- Budget had to fit a single-practice clinic, not a venture-backed B2B SaaS company.
What We Tried First (That Didn’t Work)
The first version of the system was the obvious one. It was wrong, and tearing it down was the right call.
Wrong turn 1: One-shot LLM generation with a medical-claims lint.
Day three of the build. We had a prompt that took a topic and produced a 1,500-word draft, and a follow-up prompt that scored the draft for medical accuracy against a hand-written claims list. Output drafts were cleanish. Accuracy on a labeled ground-truth set: 64%.
That’s not a content system. That’s a draft generator with a janky checker bolted on.
The failures were specific. Confident-sounding medical claims with no citation. Therapy recommendations the doctor didn’t endorse. Cultural mismatches (recommending diet patterns that don’t exist in Indian households). And the lint pass missed most of these because it was scoring the same generation pass that had produced the errors. You can’t have one model grade its own homework reliably.
We rebuilt around a multi-stage pipeline where each stage has a different job and different prompts. The stages catch each other’s failures. That took the medical-accuracy score from 64% on the first version to 92% on the third revision. The remaining 8% is what the physician review queue is for.
Wrong turn 2: Building our own keyword tool because the public APIs felt slow.
We spent half a day starting to scrape and rank-aggregate keyword data ourselves. This was a vanity engineering choice. Google’s Keyword Planner API returns volumes that are good enough for topic prioritization, and the Search Console API gives the actual rank-position-and-impression data once posts are live. We threw away the scraping code and used both as designed. Lost half a day, gained a defensible data pipeline.
The Architecture That Actually Shipped
The pipeline has seven stages. The boring stages are the ones that matter most.
Topic Selection → Brief Generation → Draft Generation → Quality Gates →
Physician Review Queue → Publishing → Performance Feedback Loop
Stage 1: Topic selection. Pulls keyword data from Google’s Keyword Planner API for the practice’s specialty cluster. Filters by search volume (minimum threshold, varies by niche), keyword difficulty, and topical relevance to the doctor’s services. Drops topics that are already covered (vector similarity search against existing posts, using text-embedding-3-small). Surfaces the top 20 candidates per week. Output is a ranked CSV the system reads from.
Stage 2: Brief generation. Each topic gets a brief: target keyword, suggested H2s based on the SERP top 10, a ‘must include’ list of citations from medical sources (PubMed, Cochrane, ICMR for India-specific content), and a ‘must avoid’ list of contraindications. The brief is what feeds the draft prompt.
Stage 3: Draft generation. Claude Sonnet, structured prompt, brief as input. The prompt is tuned for the doctor’s voice (we trained it on 40 of her existing Instagram captions and a clinical-tone reference she approved). Output is markdown with required citations.
Stage 4: Quality gates. This is where most of the engineering effort went. Before any draft reaches the doctor, it runs through automated checks:
- Citation presence (every claim with a number or a clinical assertion must have a citation; we use a separate LLM pass for this scoring).
- Brand tone consistency (cosine similarity against a reference embedding from her approved posts).
- Scope boundary check (flags any draft that strays into territory outside her practice’s specialty, e.g., pediatric oncology or orthopedics, since those carry liability).
- Structural validation (H1, H2 hierarchy, FAQ schema, internal-linking targets).
- Sensitive-claims filter (specific phrases that need clinician review even if they pass other gates).
In the first audit of these gates, automated checks caught issues in roughly 40% of drafts before they reached the physician for review. Her review time gets spent on clinical judgment, not on triaging structural problems.
Stage 5: Physician review queue. This was the hardest engineering problem and the one that took the most iteration. The doctor is a clinician, not a CMS user. Drafts can’t reach her as wireframed editor screens. They reach her as clean PDF documents with three buttons: approve, edit, reject. Edits go back into the draft as tracked changes. Approval triggers publish. Reject sends the topic back to stage 1 with a reason code. Average review time stabilized at 11 minutes per post once the quality gates were tuned.
Stage 6: Publishing. Approved drafts compile to static HTML, deploy to a Cloudflare Worker (Astro static output), and submit URLs to Google Search Console via the Indexing API. Sitemap regenerates and pings Bing Webmaster Tools too, though that auth path has been flaky for us (separate longstanding issue).
Stage 7: Performance feedback loop. Every Sunday, the system pulls the past 7 days of impression and click data from Search Console, joins it back to topic metadata, and adjusts next week’s topic priorities. Posts ranking 5–15 with rising impressions get expanded coverage; topics with zero traction after 4 weeks get retired; net-new opportunities surface from the rising-queries report.
What’s Actually Running
The whole system runs on free-tier infrastructure. Marginal compute cost is $0 (the practice pays only the engagement fee, not the infrastructure). Specifically:
- Hosting: Cloudflare Workers + Pages (free tier handles the impression volume comfortably).
- Email/lead capture: Resend for the lead-magnet email flow (3,000/mo free, more than enough for current volume).
- Database: Cloudflare D1 (SQLite at the edge) for the topic queue and review-state tracking.
- LLM calls: Anthropic API for drafting + grading. The marginal cost per post is roughly $0.40 in tokens. At 7 posts/week that’s $11.20/month, billed to us not the client.
The engagement fee covers the build amortization, the model costs, and ongoing system maintenance. The “$0 infrastructure” line in the case study isn’t marketing copy, it’s literal: AWS or GCP would multiply this by 10–20x with no upside.
What I’m Still Iterating On
Six months in, two parts of the system still need work:
The brand-tone embedding drift. As Dr. Suganya edits drafts, her edits are subtly retraining the system’s tone reference. After about three months we noticed the cosine similarity threshold was passing drafts that felt ‘flatter’ than her original voice. We now retrain the reference embeddings monthly using the most recent 20 approved posts as the anchor set. Open question: should that be weekly?
Topic selection feedback loop attribution. When a post drives a consultation click, attributing that back to the topic-selection algorithm requires the right join across page views, scroll depth, and WhatsApp click events. We’ve got it working, but the attribution window is fragile (we use 14 days; some patients read multiple posts over a longer cycle). I’m not sure 14 days is right.
Both are tractable. Neither is blocking results. They’re the kind of thing you don’t notice when you launch but matter at month six.
What Surprised Me
Five weeks after launch, ChatGPT started citing Fertilia’s posts as sources. We didn’t optimize for that. The posts have proper structured data, real citations, and consistent author attribution, which is what LLMs surface as authoritative. Now there are 40+ monthly visits from ChatGPT alone, with smaller numbers from Microsoft Copilot. AI search visibility wasn’t on the brief. It’s becoming a meaningful share of traffic anyway.
The other surprise: the publishing rhythm forced compounding even when individual posts didn’t take off. Five weeks of consistent daily publishing, with the feedback loop nudging topic selection toward what was working, beat any individual breakout post. Week one had 120 impressions. Week five had 5,025. The math is geometric, not linear, when topic selection is data-driven.
FAQ
How long did the build take, end to end?
Three weeks of focused engineering for the first running version, then two weeks of iteration on the quality gates and physician-review UI. The system went live for daily auto-publishing at week four. The first measurable impression curve started in week five.
Could you do this without the physician review step?
Technically yes. The quality gates would still catch most structural and accuracy issues. We would not. For a healthcare practice, the doctor’s clinical sign-off isn’t a feature, it’s the entire trust model. For a non-healthcare niche where claims are less liability-loaded, the review step can be lighter (an editor sampling instead of approving every post). The architecture stays the same.
What’s the role of human writing in this system?
Roughly 11 minutes per post, all from the doctor, all clinical-judgment work. She corrects medical framing, sometimes adds a clinical observation she wants in, occasionally rejects a topic if it’s outside her practice approach. She doesn’t write paragraphs from scratch. Drafts that need that level of intervention go back into the queue with a reason code.
Why Cloudflare Workers instead of a more traditional stack?
Cloudflare Workers gives us global edge deployment on a free tier that handles the traffic comfortably, plus zero cold starts. For a daily-publishing site with growing organic traffic, the latency profile and the infrastructure cost both matter. We’ve used Vercel and Netlify for similar builds; for this one, Workers won on the free-tier ceiling and the request-volume math.
What would you do differently if you started this build today?
Two things. First, I’d build the physician-review queue before the generation pipeline, not after. We built generation first because it felt like the hard problem, but the review queue ended up being the bottleneck for getting to production. Build the constrained piece first. Second, I’d version the brand-tone reference embedding from day one, not from when we noticed drift. Cheap to do up front, painful to retrofit.
If this kind of system would help your business and you want to see what the keyword opportunity actually looks like in your niche, book a 30-minute call. We’ll pull live data during the call, and you’ll see the topic landscape before we discuss anything else.