Case Studies
· 11 min read

How We Built a Coding Assessment Tool with AI Evaluation

Build story: coding assessment platform with a custom compiler engine, AI-generated test cases, and approach scoring. Real decisions, real tradeoffs.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
How We Built a Coding Assessment Tool with AI Evaluation
TL;DR
  • We started with a third-party code execution service (OneCompiler). At scale, the cost made no sense, so we built a custom sandboxed compiler engine from scratch instead.
  • AI test case generation saved us 3 weeks of manual work but introduced a class of problem we didn't expect: LLM-generated edge cases that were technically correct but rejected by the scoring logic because they tested behavior the spec didn't specify.
  • Approach scoring (partial credit for correct logic, wrong output) required a two-pass evaluation: first run the code, then run an LLM analysis on the code structure if the output check failed.
  • The hardest infrastructure problem was sandboxing. A student's malicious code crashing the execution server would take down assessments for everyone. We learned this the hard way.
  • This project started as a single-engineer web app. It turned into a 3-engineer, 6-month platform engagement because the first delivery was good enough that the client came back with the real scope.

We’d been running the platform for about six weeks when a student submitted a solution that spawned 10,000 threads.

Not maliciously. The student was trying to demonstrate concurrency. But the code had an infinite loop inside a thread factory, and our execution environment at the time had no thread count limits. The assessment server maxed out at 100% CPU, and for about 4 minutes, every active assessment in the system was frozen while the server recovered.

That’s when we knew the third-party code execution service we’d been using wasn’t going to scale with us.

How the Project Started

The client ran coding assessments for job applicants and internal training. They had a working setup using OneCompiler for code execution, a third-party service that handles compilation and output generation through an API. It works fine for low volume. Their problem was cost: at scale, the per-execution fees were significant, and they had no visibility into what was happening inside the execution environment.

The initial ask was a student portal and admin module. Clean frontend, assessment management, results dashboards. One engineer, four weeks.

We built that. It worked well. Three months later, they came back with the bigger ask: “We want to own the code execution layer too.”

That’s when the project turned into something considerably more interesting.

Building the Compiler Engine

Replacing a managed code execution service with an in-house one is one of those tasks that sounds straightforward until you start listing the requirements.

You need to run arbitrary code submitted by users. That means sandboxing is non-negotiable. The code could be malicious (intentionally or by accident), slow, memory-hungry, or designed to access parts of the filesystem it shouldn’t. You need timeout enforcement, memory limits, process isolation, and network restrictions. All of this needs to be reliable under concurrent load.

We built the compiler engine on Firebase Cloud Functions with a Node.js orchestration layer. The execution itself runs in isolated containers with strict resource limits: 256MB memory cap, 5-second timeout for most languages (10 seconds for Java, which has JVM startup overhead), no outbound network access.

The multi-language support took longer than expected. We shipped with Python, JavaScript, Java, and C++. Each language has a different compilation model:

  • Python: interpreted, so execution is direct. The challenge is subprocess isolation and output capture.
  • JavaScript: runs in a V8 context with restricted global scope. We had to block Node.js built-in modules by default and whitelist specific ones.
  • Java: JVM startup is slow (~800ms before any user code runs). We ended up keeping warm JVM instances in the container pool rather than cold-starting every execution. Brought average Java execution time from 3.2 seconds to 1.1 seconds.
  • C++: requires compilation as a separate step, which adds a round-trip. We cache compilation artifacts for identical source code, which helps when many students submit similar solutions.

The sandboxing bug from the thread story happened early in this phase. Our fix was a combination of OS-level process limits (ulimit settings in the container) and application-level thread counting before execution starts. We also added an execution watchdog that monitors CPU usage every 500ms and kills any process that holds over 90% CPU for more than 2 consecutive seconds.

Test Case Generation: Where We Brought AI In

Before AI test case generation, creating a comprehensive test suite for a coding problem was manual work. The admin would write the problem statement, then manually define 8-15 test cases: typical inputs, boundary conditions, edge cases, performance tests.

This was the bottleneck for adding new problems to the assessment library. A well-tested problem with 12 test cases might take 2-3 hours to set up correctly, including verification that the test cases were actually solvable and that the expected outputs were correct.

We added an LLM-assisted test case generator. The admin provides the problem statement and at least one reference solution. The system:

  1. Parses the problem statement to extract the input/output specification
  2. Generates 20-25 candidate test cases covering: typical inputs, boundary conditions (empty input, single element, max size), performance cases (large n), and problem-specific edge cases
  3. Runs all candidate test cases against the reference solution to get expected outputs
  4. Flags any test case where the reference solution throws an error or returns an ambiguous result

The time savings were real: what took 2-3 hours now takes about 20 minutes (mostly review time). But the AI introduced a problem we didn’t anticipate.

LLM-generated edge cases sometimes test behavior the problem statement doesn’t specify. A problem about “sorting an array of integers” might get a test case with duplicate values, negative numbers, or an empty array. Those are legitimate edge cases for a sorting algorithm. But if the problem statement was written assuming non-negative distinct integers, and the reference solution doesn’t handle negatives, the generated test case produces an incorrect expected output.

The fix was a human review gate. The system flags any generated test case where the expected output involves null, error states, or values outside the range defined in the problem statement. A human reviews those before they’re added to the live test suite. We still caught bad test cases slipping through about 8% of the time even with this gate, which means the admin verification step can’t be skipped entirely. We tried making it optional. We reverted that after two weeks.

Scoring Code: Beyond Pass/Fail

The initial scoring model was simple: run the code against all test cases, count how many pass, divide by total. Binary per test case.

The client wanted something more nuanced for training assessments (not hiring assessments, where binary is appropriate). For training, partial credit matters. A student who understood the algorithm but made an off-by-one error should score higher than a student who brute-forced it correctly.

We built a two-pass evaluation:

Pass 1: Test case execution. Run the code, check outputs against expected. Straightforward.

Pass 2: Approach analysis (only runs if Pass 1 score is below 70%). Send the code to an LLM with the problem statement and rubric criteria. The LLM analyzes: Did the student choose an appropriate algorithm? Is the logic structure correct, even if the output isn’t? Are there identifiable patterns that show understanding of the problem?

The second pass produces a 0-30% bonus that can be added to the execution score. A student who got 6/10 test cases right but demonstrated correct algorithmic thinking might score 60% + 20% = 80%.

This only runs on underperforming submissions to keep costs reasonable. About 40% of submissions in training assessments trigger the second pass. At around $0.004 per LLM analysis call, the cost per assessed submission is about $0.0016 averaged across all submissions.

The hardest part was defining the rubric criteria in a way the LLM applied consistently. We built a calibration loop similar to what we’d done on our earlier AI evaluation work: 100 manually graded submissions, compare AI approach scores to human scores, adjust the rubric prompt until agreement was above 85%. It took four rounds.

The Plagiarism Detection Problem We Didn’t Solve Well

The client asked for plagiarism detection. We built it. I’m not proud of how it works.

We compare AST (abstract syntax tree) structure across submissions for the same problem. Two submissions with identical ASTs but different variable names get flagged. We also check for code snippets that appear verbatim in Stack Overflow answers by sending the code to a search API and comparing.

The AST comparison catches obvious copy-paste. It doesn’t catch collusion where two students discuss the solution and write separate implementations. The Stack Overflow check catches public solutions but not solutions from private code-sharing sites or AI-generated code.

We told the client clearly: this flags the obvious cases. It’s not a full academic integrity system. For high-stakes hiring assessments, they use proctoring software alongside this. For training assessments, they treat the tool as a nudge, not a verdict.

If you’re building a platform where plagiarism detection genuinely matters (university exams, certification programs), plan for something more rigorous than what we built here.

The Infrastructure Decision That Saved Us

Midway through the compiler engine work, we had a choice: Firebase Cloud Functions for execution orchestration, or a dedicated VPS for the execution environment.

Firebase was the path of least resistance because we were already using Firebase RTDB and Firestore for the assessment state. Cloud Functions for the orchestration layer made deployment simpler. But Firebase Cloud Functions have a cold start problem for execution-heavy workloads, and they don’t give you the fine-grained OS-level controls that sandboxing requires.

We split it: Firebase Cloud Functions for the API and orchestration layer (triggering execution, collecting results, updating scores). A separate Dockerized execution service pattern for the actual code running. It’s a similar separation of concerns to what we used when building the video auditing pipeline: keep the orchestration cheap and stateless, keep the heavy execution contained. The execution containers run on a small VM with our sandboxing configuration applied at the OS level, not the application level.

This turned out to be the right call. The execution service scaled independently from the Firebase functions, we had direct control over the execution environment, and cold starts stopped being a problem once we kept a minimum of 3 execution workers warm at all times.

What 6 Months of Being a Repeat Client Looks Like

We shipped the initial portal (1 engineer, 4 weeks) and it worked. The client came back 3 months later with the compiler engine scope (3 engineers, full platform build). They came back because the first delivery was solid: clean code, predictable timelines, no surprises at handoff.

The 6-month engagement wasn’t planned from the start. It’s what happens when the first version earns trust.

The platform is live in production. The assessment module handles active coding assessments for the client’s users. The AI evaluation layer (test case generation, approach scoring) is deployed and running on training assessments. The compiler engine is handling Python, JavaScript, Java, and C++, with Go on the roadmap.

We still don’t have a fully satisfying answer to the plagiarism detection problem. That’s a known limitation we’ve documented and communicated to the client. The rest of the system works as designed.

FAQ

How much does it cost to run a custom compiler engine vs a third-party service?

Depends heavily on volume. At the scale where this client was operating, the VM running our execution service cost around $80/month. OneCompiler’s equivalent volume would have cost approximately 4-6x that in API fees. The crossover point (where custom beats third-party on cost alone) is roughly 5,000-10,000 code executions per month. Below that, a managed service is cheaper when you factor in build and maintenance time.

What programming languages are hardest to sandbox?

Java is the most resource-intensive because of JVM overhead. C/C++ are the most security-sensitive because they can make direct system calls that need to be blocked at the OS level. Python is the most prone to sneaky resource usage (spawning subprocesses, importing heavy libraries). JavaScript in Node.js is relatively safe if you restrict the global scope and block built-in modules by default.

How do you handle time limit exceeded (TLE) errors fairly?

Students on different network connections submit to the same execution environment, but their code runs server-side, so network speed doesn’t affect runtime. The fairness issue is concurrent execution: if 50 students submit at the same time, execution queue depth can add wait time. We report both raw execution time and queue wait time separately so that results show code performance, not infrastructure latency.

Can AI generate good test cases for all problem types?

It works well for standard algorithmic problems (sorting, searching, graph problems, dynamic programming). It works less well for problems with natural language processing components, problems that depend on external APIs, or problems where the “correct” output is inherently ambiguous. For those categories, we still rely primarily on manually written test cases with AI filling in the standard edge cases.

When does building a custom platform make sense vs buying LeetCode-style SaaS?

If you need to integrate deeply with your own user system, assessment flow, or LMS, off-the-shelf tools get painful fast. If you need custom scoring logic (partial credit, approach evaluation), most SaaS platforms don’t support it. If you’re running assessments at high volume and the per-execution cost of managed services starts adding up, custom can be cheaper within 6-12 months. The build cost is real, though: plan for 2-3 months minimum to get a production-grade platform.


Building a coding assessment platform, or an AI evaluation layer for your existing system? We’ve done both. Book a 30-minute call and we can walk through what the architecture would look like for your specific setup.

#ai app development#coding assessment#compiler#firebase#llm#case study#edtech
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us