Case Studies
· 10 min read

How We Built an AI Education Content Creator

4-week build story: AI pipeline that cut EdTech course production from 3-4 weeks to 1 day. What broke, how we fixed Bloom's calibration, real numbers.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
How We Built an AI Education Content Creator
TL;DR
  • Week 2 of the build, the client's content lead flagged 23 specific problems with the AI-generated courses. That spreadsheet ran my week 3.
  • Three fixes resolved the accuracy problem: reference material injection, content-length validation beyond schema compliance, and a 40-example Bloom's taxonomy calibration set.
  • Bloom's taxonomy tagging went from 65% correct to 87% after we gave the model labeled examples, not just a description of the taxonomy levels.
  • The hardest part wasn't the AI. It was LMS integration: two metadata fields were documented wrong, costing a full day each in week 4.
  • Result: course production from 3-4 weeks to roughly 1 day. Same content team, same quality bar, 95% less time per course.

Week 2 of the build, I was on a Friday afternoon call with the client’s lead content developer. She’d spent the morning reviewing two AI-generated courses and had put together a spreadsheet.

Twenty-three items.

“The content is accurate,” she said, “I just can’t teach from it.” She pulled up one example: a lesson on probability that was technically correct and used a coin flip as the worked example. “Our learners are software engineers. They’ve seen a coin flip explanation a hundred times.”

That spreadsheet ran my week 3. This is the story of what was in it and how we fixed it.

For the strategic overview of what this system does and why the architecture works, read Venkat’s writeup on the same project. This post is about the build experience: the specific failures, the calibration work, and what week 4 actually looked like.

What I Was Building

High-level: a 4-stage AI pipeline for an EdTech provider. Input is a topic brief plus reference materials. Output is a complete course (outline, lessons, assessments) with human review checkpoints at each stage before the next stage runs.

I owned the AI backend: Python and FastAPI, the generation logic, the JSON Schema validation layer, and the prompt architecture. The other two engineers handled the React review interface and the LMS integration. All of us touched the prompts at different points.

The full case study has the stack details and outcome numbers. This post zooms in on the week-by-week reality.

By day 5, the pipeline was running end-to-end. GPT-4o could generate educational content. Schema validation was passing. We had something to show.

Week 2 is when I found out what “running” actually meant.

The Week 2 Problem List

The client reviewed our first two complete courses and came back with 23 specific issues, sorted into three categories.

Category 1: Wrong examples (8 items). The model was defaulting to generic textbook examples instead of domain-specific ones. Coin flips for probability. Bubble sort for algorithm complexity. These examples work in a general CS curriculum. This client’s learners are working software engineers who already know bubble sort. The examples had to come from their actual domain, not from “what a textbook usually says.”

Category 2: Schema compliance gaps (9 items). The JSON Schema validation was passing for all of these, which was the problem. The schema enforced structure: a common_misconceptions field had to be an array of strings. It didn’t enforce meaningful content. The model was producing valid-but-useless entries: ["Students sometimes confuse X and Y"]. Technically a string. Practically nothing.

Category 3: Bloom’s taxonomy miscalibration (6 items). Almost everything the model generated came back tagged as “understand” or “apply.” About 70% of all items. A well-constructed course needs variety across all six levels. An entire course clustered at two levels doesn’t serve learners.

Categories 1 and 2 had straightforward fixes. Category 3 took most of week 3.

Fixing Category 1: Reference Material Injection

The coin-flip problem had an obvious cause. When the model generates a lesson with only a topic brief as input, it draws on its training data for examples. Training data skews toward textbook explanations and the most common pedagogical choices.

The fix: extract key concepts, definitions, and examples from the client’s reference materials before each generation call, then inject them as constraints. “Use these definitions. Reference these examples.” The generation prompt went from “write a lesson on probability” to “write a lesson on probability, using the following domain-specific context…”

This added a preprocessing step and some prompt complexity. It also moved first-pass accuracy on examples from “almost always wrong” to “usually right.” The SME spot-checks after this change took 20 minutes instead of two hours.

Fixing Category 2: Content-Length Validation

Schema validation catches structural failures. It doesn’t catch content quality failures.

The solution was a content-length check layer on top of schema validation. Any required text field under a minimum length threshold (we set it at 80 characters for descriptive fields, 200 for the worked example) triggers a retry with a stricter prompt that calls out specifically what the previous attempt got wrong.

For the common_misconceptions field, the retry prompt said: “The previous attempt provided a generic entry. Provide two specific misconceptions learners commonly have, with an explanation of why each is wrong.” That level of specificity in the retry prompt mattered. Generic retry instructions produced only marginally better output.

OpenAI’s structured outputs enforces schema compliance at generation time, which eliminates most structural failures. It doesn’t solve content quality issues. Those need validation logic on our side.

Fixing Category 3: Bloom’s Taxonomy Calibration

This took the most time and the fix is worth documenting in detail because I’ve seen the same pattern on other projects.

Getting a language model to apply Bloom’s taxonomy consistently is genuinely hard. The levels (remember, understand, apply, analyze, evaluate, create) are subjective at the boundaries. “Apply” vs “analyze” depends on what the learner already knows, which the model doesn’t have. Giving the model a description of each level in the system prompt produced the distribution we saw: heavy clustering around the middle levels where the descriptions are most similar.

First attempt: explicit distribution targets. I added an instruction to the prompt specifying target percentages for each level across the course. “This course should have approximately 20% remember, 25% understand, 30% apply, 15% analyze, 7% evaluate, 3% create.”

This helped with distribution but not accuracy. The model would produce a distribution that matched the targets while tagging individual items incorrectly. An “analyze” task might get tagged as “create” to hit a quota.

What actually worked: reference examples. I asked the client’s content team for 40 assessment items from their existing catalog, ones they’d manually tagged at the correct Bloom’s level. I included those examples in the system prompt, grouped by level, with a brief explanation of what made each one a good example of that level.

With the reference set, tagging accuracy went from about 65% to 87%.

87% still requires review. But at 87%, the SME is confirming tags, not retagging from scratch. At 50 assessment items per course, the review time difference is roughly 20 minutes. Small in absolute terms, but meaningful at the throughput they were targeting. The Vanderbilt CFT’s guide to Bloom’s taxonomy was useful for designing the example selection; their framework for distinguishing similar levels was helpful for picking reference items that clearly illustrated the boundaries.

The generalizable pattern: if a model needs to apply a subjective classification scheme, give it labeled examples, not just a description of the scheme.

Week 4: LMS Integration

Week 4 was supposed to be polish, documentation, and handoff. It was mostly debugging the LMS integration.

The client’s learning management system had documented metadata requirements for course import. Two fields were documented incorrectly.

The estimated_completion_minutes field was documented as accepting a string. Their actual import validation expected an integer. We found out when the first batch import failed with a cryptic type error that took half a day to trace back to that field.

The difficulty field was documented using the values “beginner”, “intermediate”, “advanced.” Their actual schema used a 1-3 numeric scale mapped to those labels. Another half day.

Neither of these was avoidable from the documentation alone. LMS API documentation is notoriously optimistic. We built a day of buffer into week 4 for exactly this kind of thing. It was right to do so. In hindsight, asking for a sample import file at project kickoff would have caught both issues before we’d written any integration code.

The Numbers

4 weeks. 3 engineers plus a PM. GPT-4o for generation, Python and FastAPI for the backend, React for the review interface, PostgreSQL for storing generation history and calibration data, JSON Schema for output validation.

Post-deployment: course production time from 3-4 weeks to roughly 1 day. The 95% figure is real. For courses where the client already has strong reference materials, the SME review sometimes takes under 2 hours.

The content team didn’t shrink. Their job changed. Before: writing everything from scratch. After: reviewing AI drafts, improving templates when the model consistently gets something wrong, handling the edge cases outside the pipeline. That’s a different kind of work, and for a subject matter expert it’s a better use of their time than writing the same course structure for the eighth time.

What I’d Do Differently

Get the LMS integration spec on day 1. Every field, every data type, a sample import file. Documentation is always wrong about something. Finding that in week 1 is better than week 4.

Ask for calibration examples upfront. The 40-item Bloom’s taxonomy set that fixed category 3 took two days to collect in week 3 because we didn’t know we’d need it until we saw the problem. Those examples are project inputs, not a nice-to-have. They belong in the kickoff checklist.

Put the content-length checks in version one. We added them after seeing the week 2 problems. Schema validation alone doesn’t catch content quality failures. That lesson should have been obvious before the first client review.

FAQ

How long does it take to build an AI course content pipeline?

The build we described took 4 weeks with 3 engineers. Most of that time is calibration and integration, not the generation itself. The GPT-4o integration is fast. Getting first-pass accuracy high enough that SME review is editing rather than rewriting takes 2-3 weeks of iteration. Plan for at least one full calibration loop with your content team before committing to a timeline.

Does AI-generated educational content need human review?

Yes, for anything reaching learners. The pipeline we built routes every generated course through 4 review stages. As you accumulate data on where the model gets things consistently right, you can narrow review to the items that still need it. But skipping review entirely for educational content isn’t something we’d recommend. The cost of a learner studying from subtly wrong information is too high.

Which LLM works best for educational content generation?

GPT-4o is what we used. The structured output support and instruction-following reliability were the deciding factors. Claude 3.5 Sonnet is comparable and handles complex instructional constraints slightly better in my experience. The model choice matters less than the prompt architecture. Both support JSON schema enforcement at generation time, which is the feature you actually need.

What’s the risk of AI-generated educational content?

Plausible incorrectness. The model generates content that sounds right, passes surface-level review, but is subtly wrong in a way that confuses learners. This is more dangerous than obviously wrong content because it gets through review more easily. Reference material injection reduces this significantly: when you force the model to use your approved definitions and examples, it can’t fall back on plausible-but-wrong general knowledge.

Can this approach work outside EdTech?

Yes. The same architecture applies anywhere an expert produces structured, templated content at volume: legal document drafting, technical documentation, compliance reports, product descriptions at scale. The template structure changes. The core pattern doesn’t: define the schema, inject reference material, stage the human review, calibrate the model’s subjective judgments with labeled examples. What changes across domains is the accuracy threshold and the cost of getting it wrong.


We’ve built AI content pipelines for EdTech providers and run the same system on our own publishing. If you’re evaluating whether this approach fits your workflow, book a 30-minute call. We’ll walk you through the architecture and tell you where the calibration work usually shows up for your content type.

#ai for edtech#ai integration services#content automation#gpt-4o#case study#course development
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us