Case Studies
· 11 min read

How We Built AI-Powered Evaluation for an EdTech Platform

Build story: AI answer evaluation for an EdTech assessment platform. Rubric extraction, semantic scoring, calibration, and real accuracy numbers.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
How We Built AI-Powered Evaluation for an EdTech Platform
TL;DR
  • Direct LLM scoring without a rubric is unreliable. On our first run, 23% of correct answers got scored wrong because the model applied implicit criteria the client never specified.
  • Rubric extraction is its own LLM call, separate from scoring. Running both in one prompt degrades accuracy on both tasks.
  • We built a human-agreement calibration loop: 100 manually scored samples, compare AI scores, fix the rubric until agreement is above 90%.
  • Semantic similarity (cosine over embeddings) handles the easy cases. LLM scoring only runs on answers that fall in the uncertain middle band.
  • Final accuracy: 94.3% agreement with expert graders on a held-out test set of 450 answers.

The first time the AI evaluator failed badly, it was on a student who’d written a genuinely good answer.

The question was about the causes of the French Revolution. The student had written three paragraphs covering economic inequality, Enlightenment ideas, and the political structure of the ancien régime. It was the kind of answer a teacher would mark highly. Our AI scored it 2 out of 5.

The reason: the model had implicitly decided that a “complete” answer required mentioning the food shortages of 1788. The client’s rubric said nothing about food shortages. We just hadn’t specified it, and the model filled in the gap with what it thought a complete answer should include.

That failure shaped the whole rest of the build. Here’s the full story.

What the Client Was Trying to Solve

An EdTech company running large-scale assessments for K-12 students. They had a working platform: students log in, take an assessment, get a score. The MCQ portion was easy to automate. The problem was the short-answer and essay-format questions, which their subject-matter reviewers were grading manually.

At peak, they had 12,000+ student submissions with open-ended answers during contest periods. A team of expert graders working through that volume took 10-14 days. By the time scores came back, the contest window had moved on. Students wanted faster feedback. The client wanted to cut grading time by at least 70%.

The requirement: AI-assisted evaluation that a human reviewer could audit, not one that replaced human judgment entirely. Every AI score came with a confidence level. Below a threshold, the answer went to a human grader. Above it, the AI score stood.

That “human-in-the-loop” framing turned out to be important. It freed us to build a system that was right 94% of the time rather than wasting months chasing 99%.

Why Direct Prompting Doesn’t Work

Our first prototype was simple: take the question, the marking guide, the student’s answer, and ask GPT-4o to score it on a 5-point scale with a brief explanation.

Against our initial test set of 100 manually graded answers (a mix of question types and difficulty levels), we got 71% agreement with the expert graders. That sounds reasonable until you look at what the disagreements were. They weren’t close calls. The AI was systematically wrong on certain answer patterns.

Three failure modes kept appearing:

Literal matching bias. The model rewarded answers that used the exact vocabulary from the marking guide. A student who wrote “laissez-faire economic policies” scored higher than a student who explained the same concept in different words. The expert graders cared about understanding, not vocabulary matching.

Implicit criteria. As with the French Revolution example, the model added evaluation criteria that weren’t in the rubric. It seemed to draw on its own knowledge of “what a complete answer looks like” and penalized answers that didn’t match.

Length bias. Longer answers scored higher, controlling for content quality. We noticed this after running a quick correlation on score vs. word count. Students who wrote more tended to get better AI scores, even when the extra text was repetitive or off-topic.

Any of these alone was manageable. All three together meant the AI was grading against a hidden rubric the client hadn’t written and didn’t know about.

The Architecture We Actually Shipped

The working system has three stages. They’re intentional.

Stage 1: Rubric extraction. Before any student answer gets evaluated, we run a separate LLM call that extracts the scoring rubric from the question and the teacher’s marking guide. This produces a structured JSON object: each scoring point with a weight, acceptable paraphrases, and common wrong answers.

{
  "question_id": "hist_q_14",
  "max_score": 5,
  "criteria": [
    {
      "id": "c1",
      "description": "Identifies at least two social/economic causes",
      "weight": 2,
      "valid_concepts": [
        "economic inequality", "tax burden on peasants",
        "debt crisis", "bread prices", "food shortage"
      ],
      "partial_credit": true
    },
    {
      "id": "c2",
      "description": "Identifies political/ideological causes",
      "weight": 2,
      "valid_concepts": [
        "enlightenment ideas", "absolute monarchy",
        "estates system", "lack of representation"
      ],
      "partial_credit": true
    },
    {
      "id": "c3",
      "description": "Demonstrates understanding of cause-effect relationship",
      "weight": 1,
      "valid_concepts": [],
      "partial_credit": false
    }
  ]
}

Separating rubric extraction from scoring was the single biggest improvement. Running both in one prompt made the model worse at both tasks. The extraction call uses Claude 3.5 Sonnet (it handles nuanced instruction-following well). The scoring call uses GPT-4o.

Stage 2: Semantic pre-screening. For each criterion, we compute cosine similarity between the student’s answer (embedded with text-embedding-3-small) and the list of valid concepts. If similarity exceeds 0.85 for at least one concept, that criterion passes without the LLM call.

This handles about 60% of criterion checks (the unambiguous cases where the student clearly addressed a point or clearly missed it). The LLM only touches the uncertain middle band (similarity between 0.45 and 0.85). This cut our evaluation cost roughly in half and dropped latency from an average of 3.8 seconds per answer to 1.9 seconds.

Stage 3: LLM scoring on uncertain criteria. For criteria that didn’t resolve in the semantic stage, we run the structured scoring prompt with one change: the explicit rubric from Stage 1 replaces the raw marking guide. The model scores against what we extracted, not against what it thinks the answer should contain.

The prompt format matters more than I expected. We tried several before landing on one that worked: we give the model the criterion description, the list of valid concepts, the student’s answer, and a strict instruction to score only against the provided criteria. We explicitly tell it to ignore content that’s correct but outside scope. That last instruction cut the implicit criteria problem from 12% of answers to under 2%.

Calibrating Against Human Graders

Getting to 94% agreement required a calibration loop we ran three times. A similar bias problem came up when we built a text-to-SQL data analyst: the model filled in column names it assumed rather than the ones in the schema. The fix was the same pattern: inject structured metadata, don’t let the model infer.

Round 1. Scored 100 answers with the initial system. Sent the same 100 answers to two expert graders independently. Compared results. Agreement: 81%.

Disagreement analysis. Categorized every mismatch: implicit criteria, length bias, vocabulary bias, partial credit handling, or genuine ambiguity. Most disagreements (about 70% of them) came from the partial credit logic in the rubric, not from the LLM evaluation itself. We’d extracted rubrics that treated partial credit as binary when the experts were applying it on a spectrum.

Rubric refinement. Updated the extraction prompt to capture degree language (“mentions”, “explains”, “demonstrates with an example” as three distinct thresholds). Re-ran calibration on 100 new answers.

Round 2. Agreement: 89%. Closer, but still falling short of 90% on the essay-format questions specifically. Short-answer questions were at 93%.

Root cause. For essay questions, the model was scoring each criterion independently and then summing. Expert graders weighted answers holistically: a student who partially addressed all criteria got a different score than one who fully addressed two and skipped the third, even if the point totals were the same. We added a whole-answer adjustment step: the LLM reviewed the summed score in the context of the full answer and could adjust by ±1 with a required explanation.

Round 3. Agreement: 94.3% on a held-out test set of 450 answers. We stopped there. The remaining 5.7% of disagreements were genuine cases where the two human experts also disagreed with each other.

What Still Goes to a Human

Answers with overall confidence below 0.75 automatically flag for human review. That ends up being about 8-12% of submissions, depending on question type. Essay questions flag more often than short-answer ones.

There are also hard-coded bypass rules: any answer under 10 words gets a human review regardless of confidence (students sometimes write legitimate two-word answers that the semantic layer handles wrong), and any answer the rubric extraction flagged as ambiguous (a “partial_credit_ambiguous” field in the JSON) skips AI scoring entirely.

The platform keeps both the AI score and the human-reviewed score when review happens, so over time we’re building a corrections dataset. We haven’t fine-tuned on it yet, mostly because the calibration loop approach has been cheaper to maintain than a fine-tuning pipeline. That might change once we have a few thousand corrections.

The Numbers

MetricValue
AI-vs-expert agreement (held-out test set)94.3%
Answers auto-routed to human review~10%
Average evaluation latency per answer1.9s
Grading turnaround (before vs. after)10-14 days → same day
Rubric extraction accuracy97.1% (tested against 70 manually written rubrics)

The 10-14 day grading window is now same-day for the auto-evaluated 90%. The 10% that go to human review come back within 48 hours.

The client’s original goal was 70% automation. We’re at 90%. More than they asked for, but the calibration process gave us enough confidence in the edge case handling that we felt comfortable extending the threshold.

What I’d Do Differently

Two things.

First: build the calibration loop before writing any scoring code. We treated it as a QA step at the end. It should be the design spec at the start. The rubric extraction format we ended up with would have been completely different if we’d seen the first-round disagreements before building the scoring layer.

Second: the semantic pre-screening threshold (0.85) took us three iterations to tune. We started at 0.80, which was too aggressive and auto-passed answers that were only loosely related to the criteria. We went to 0.90, which was too conservative and pushed most answers to the expensive LLM path. 0.85 was the number that matched human judgment best on borderline cases. I’d instrument this from day one rather than tuning it in production.

If you’re building something similar, the OpenAI embedding documentation has a useful section on cosine similarity thresholds for different semantic tasks. The short version: 0.80+ is high similarity for most domains, but you need to calibrate against your specific vocabulary. Education content is more domain-specific than general text. Our thresholds ended up higher than the defaults the docs suggest.

The full case study is on our assessment platform case study page.

FAQ

How accurate is AI evaluation compared to human graders?

On our system, 94.3% agreement with expert graders on the held-out test set. That’s after three calibration rounds. Early runs with a direct prompting approach were at 71%. The gap between “we used an LLM” and “we built a calibrated evaluation pipeline” is significant. Accuracy depends heavily on rubric quality. A vague marking guide produces vague AI scores.

Can AI evaluation replace human graders entirely?

For multiple choice and clearly defined short-answer questions, yes. For essay questions and any open-ended question where partial credit has nuance, human review is still needed for the edge cases. Our system auto-routes about 10% of answers to human review. That’s not a failure. It’s the system correctly identifying the cases where it’s uncertain.

How do you prevent the AI from adding criteria that weren’t in the original rubric?

Rubric extraction as a separate step. When you let the model infer criteria from the question text directly, it adds what it thinks a complete answer should include. When you give it a structured rubric extracted from the teacher’s marking guide, it scores against the explicit criteria. We also add an explicit prompt instruction to ignore content that’s correct but outside scope. That dropped implicit-criteria mistakes from 12% to under 2%.

What’s the latency for AI evaluation?

1.9 seconds average per answer end-to-end, with semantic pre-screening handling about 60% of criterion checks without an LLM call. Without pre-screening, we were at 3.8 seconds. For bulk grading, the evaluation runs in parallel across answers, so overall throughput is high. The bottleneck isn’t latency, it’s cost. LLM calls for 12,000 answers adds up, which is why the semantic pre-screening layer matters.

What LLMs did you use?

Rubric extraction: Claude 3.5 Sonnet. Answer scoring: GPT-4o. Embeddings for semantic pre-screening: OpenAI text-embedding-3-small. We tried running rubric extraction on GPT-4o too, but the instruction-following for structured JSON output was less consistent than Sonnet. Scoring on Sonnet was about equal to GPT-4o but slightly slower on the essay cases, so we kept GPT-4o for scoring.


Building an assessment product and need AI evaluation that doesn’t embarrass you in production? Book a 30-minute call. We’ll tell you whether your rubric structure is ready for automation.

#ai app development#ai evaluation#edtech#llm#case study#assessment#nlp
Share

Stay in the loop

Technical deep-dives and product strategy from the Kalvium Labs team. No spam, unsubscribe anytime.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing — that means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

Have a question about your project?

Send us a message. No commitment, no sales pitch. We'll tell you if we can help.

Chat with us