Case Studies
· 8 min read

Meeting Intelligence Tool: 4-Hour Recording to 1-Page Brief

How we built a meeting intelligence tool that converts 4-hour recordings to a 1-page brief. Semantic chunking, multimodal frames, and what broke.

Abraham Jeron
Abraham Jeron
AI products & system architecture — from prototype to production
Share
Meeting Intelligence Tool: 4-Hour Recording to 1-Page Brief
TL;DR
  • Naive transcription of a 4-hour recording produces 41,000 words. The real problem is figuring out which 500 words matter.
  • Fixed-interval chunking loses context at split points. Semantic chunking follows actual topic boundaries and improved action item accuracy from 74% to 91%.
  • Multimodal processing (video frames alongside audio) solved the 'this number' problem: when a speaker points at a slide, the model can see what they're pointing at.
  • The 1-page brief has five sections: decisions, open questions, action items, numbers mentioned, and what needs a follow-up meeting.
  • Biggest failure in production: the model invented a revenue figure that was never mentioned. We added a citation constraint: every number must link to a timestamp.

The recording that broke our first version was 4 hours and 7 minutes long.

It was a quarterly business review. Six speakers, slides on screen, side conversations that the mic picked up. When the team uploaded it to the pipeline we’d built for weekly standups, they expected the same output: a summary, action items, decisions.

What came back was a 2,700-word wall of text. And buried in it was a $3.2M revenue figure that had never been mentioned in the meeting. The model had interpolated it from context it thought it understood.

That was day one of a three-week rebuild.

Why 4-Hour Recordings Are a Different Problem

A 45-minute standup has one topic thread, maybe a handful of decisions, a few action items. A 4-hour QBR has 12 agenda sections, speaker handoffs when someone shares their screen, and references to numbers from three months ago that get mentioned without re-stating the original context.

Our original architecture handled the 45-minute case well. It had three stages: transcription and diarization via Deepgram Nova-2, one LLM extraction pass asking for decisions and action items, and structured output. Processing time for a 45-minute meeting: about 3 minutes.

At 4 hours, the LLM extraction stage became the failure point. The full transcript was 35,000 to 42,000 tokens. We were sending that to GPT-4o with an 8K output limit. The model was compressing, prioritizing arbitrarily, and filling gaps with plausible-sounding details. The revenue hallucination wasn’t random. It was the model interpolating from context it thought it understood but didn’t have the full picture on.

A better prompt didn’t fix it. We needed a different architecture.

Chunking at Semantic Boundaries

The first fix I tried was fixed-interval chunking: split the transcript every 15 minutes, run extraction on each chunk, merge results. Seemed reasonable. It was wrong.

Conversations don’t follow 15-minute boundaries. A discussion about Q3 targets might start at minute 47, get interrupted by a headcount conversation at minute 51, then resume at minute 58. Fixed chunking drops that conversation into three separate processing windows. The model in each window doesn’t know the other two exist. You get partial decisions attributed to the wrong sections.

Semantic chunking works differently. The full transcript runs through an embedding pass first. We use OpenAI’s text-embedding-3-small to embed overlapping windows of transcript text, then identify where similarity drops below a threshold. High similarity means the same topic continuing. Low similarity means a topic shift or agenda transition. The chunks follow the actual conversation structure.

The embedding pass on a 40K-word transcript takes about 90 seconds. Worth it. Action item attribution accuracy went from 74% (fixed chunking) to 91% (semantic chunking) on a labeled set of 20 QBR recordings we used for testing.

What Multimodal Actually Changes

Most meeting intelligence tools process audio only. This works for talk-heavy meetings. It breaks for anything with slides.

When a speaker says “we need to hit this number,” the number they mean is on their screen. In audio-only processing, “this number” is a pronoun with no referent. You get an action item like: “Team committed to hitting [unresolved reference].” That’s worse than no action item.

We added video frame extraction. At each semantic chunk boundary, and whenever the audio contains a slide-click sound or the transcript includes a phrase like “switching to” or “let me show you,” we extract a frame from the video. That frame goes into a multimodal prompt alongside the chunk transcript.

The prompt asks: what is visible on screen, and does it resolve any ambiguous references in this section?

Two things improved immediately.

Numerical accuracy. Revenue figures, KPIs, targets: when the number is on a slide and in the audio at the same time, we can confirm it and cite the timestamp. Hallucination rate on numbers dropped from roughly 11% (audio-only) to under 2%.

Decision attribution. “We decided to go with option B” is much easier to extract correctly when the slide showing option A vs. option B is in the context window. The model doesn’t have to guess what options were on the table.

The frame extraction heuristic isn’t perfect. We catch about 87% of slide transitions. Good enough for production, and it’s the thing that makes this a custom build rather than something off-the-shelf.

The 1-Page Brief Format

The output is designed to fit on one printed page. It has five sections:

Decisions: what was agreed on, with the person who owns it. Example: “Migrate reporting dashboard by June 30 [timestamp 1:23:07].” No decisions without owners. If the meeting produced a direction without assigning someone, we flag it as an incomplete decision.

Open questions: things raised but not resolved, with what’s blocking them. These are the meeting’s unfinished business.

Action items: specific tasks with owners and any dates mentioned. Separate from decisions. A decision is a resolved direction; an action item is a next step someone needs to take.

Numbers mentioned: every specific number, with context and a timestamp. Revenue target, headcount, budget figure, timeline. All of them, all cited.

What needs another meeting: topics that didn’t get resolved and need more than a Slack thread to close.

This format came from running 40 recordings through the early prototype and watching what the team actually went back to read. They didn’t re-read the narrative summary. They looked up action items. They checked what numbers were mentioned. They searched for whether a specific decision had been made.

The brief is a lookup document, not a reading document.

What Broke in Production

Timestamp drift. Our diarization and video frame extraction maintained separate clocks. Video uploads accumulate audio drift of 0.2 to 1.1 seconds over a 4-hour recording. Small drift, but it meant that “timestamp 1:23:07” in the transcript pointed to a different moment in the video than in the frame extractor. We added a sync pass after transcription that normalizes both streams against the video file’s encoded timestamps.

Cross-speaker attribution. In QBR format, speakers frequently finish each other’s sentences or paraphrase without quoting. The model was attributing decisions to whoever spoke last about a topic, not to whoever made the call. We added a verification pass that checks action items against the full section where the decision was formalized, not just first mentioned.

The action items that were jokes. Someone says “maybe we should just shut it all down” during a frustrating section. The model would extract it as a real action item. We added a classifier prompt that checks whether a statement is rhetorical vs. an actual commitment. It makes mistakes on dry humor but catches the obvious ones.

The sarcasm filter is still the part I’m least confident in. I don’t have a good answer for it yet. If someone commits to something in a tone of weary resignation, the model sometimes reads it as sarcasm and filters it out. We tune the threshold conservatively (bias toward false positives, flag anything uncertain) and have a human review pass for any items the classifier marks as uncertain.

Where It Stands

We run this on our own team’s recordings and have built it for one external client so far: a B2B SaaS company with a weekly 2-hour product review and a monthly 4-hour strategy session.

Processing time for a 4-hour recording: about 8 minutes end-to-end. Output: a 1-page brief and a full searchable transcript with action items tagged to speaker and timestamp.

If you’re considering building something similar, the architecture isn’t the hard part. The hard part is defining what a “decision” means for your specific meeting format and building enough labeled data to validate that the model agrees with you. We used 40 recordings. That took two weeks.

The full case study is here if you want the product framing.


Considering building meeting intelligence for your team or product? Book a 30-minute call. We’ll tell you honestly whether this architecture fits your meeting format or whether you’d need a different approach.

FAQ

How long does it take to process a 4-hour recording?

About 8 minutes end-to-end: transcription and diarization via Deepgram (~3 min), semantic chunking and embedding (~90 seconds), multimodal extraction across all chunks (~3 min), brief assembly and citation verification (~1 min). This varies with recording quality and speaker count.

What tools does this use?

Deepgram Nova-2 for transcription and diarization, OpenAI text-embedding-3-small for semantic chunking, GPT-4o for extraction and the multimodal frame passes, Cloudflare Workers for pipeline orchestration. Video frame extraction uses FFmpeg with a simple slide-transition heuristic.

How is this different from Otter.ai or Fireflies?

Those tools focus on real-time transcription and meeting history search. They produce full-length summaries. This produces a 1-page brief specifically for long recordings where the standard summary would be 2,000+ words. The brief format is the product. It also processes recordings uploaded after the fact (board meeting from last quarter) rather than integrating with calendar software for live capture.

What meeting formats work best?

Long, structured meetings with multiple agenda sections: board meetings, QBRs, strategy sessions, design reviews. It works less well for unstructured brainstorming sessions. The brief format assumes decisions are being made in a recognizable pattern. Free-form brainstorms don’t produce decisions in a shape the model can reliably extract.

What would it cost to build something like this?

At Kalvium Labs, a version scoped for a specific meeting format and team size runs $8,000 to $15,000 depending on integrations needed (Zoom or Meet recording API, Slack notification, CRM sync). Real-time processing (streaming pipeline) adds 30 to 40% to that estimate. The cost closes quickly for any team running more than 2 hours of recorded meetings per week.

#meeting intelligence#custom ai solution#multimodal ai#llm-summarization#video processing#case study
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Abraham Jeron

Written by

Abraham Jeron

AI products & system architecture — from prototype to production

Abraham works closely with founders to design, prototype, and ship software products and agentic AI solutions. He converts product ideas into technical execution — architecting systems, planning sprints, and getting teams to deliver fast. He's built RAG chatbots, multi-agent content engines, agentic analytics layers with Claude Agent SDK and MCP, and scaled assessment platforms to thousands of users.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

Tell us your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Chat on WhatsApp

Usually reply within hours, max 12.

Prefer a scheduled call? Book 30 min →

Not ready to message? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us