Case Study

Meeting Intelligence: AI-Powered Meeting Analysis

A multimodal AI tool that processes meeting recordings and generates structured intelligence reports — summaries, key decisions, discussion flow diagrams, and AI-selected key frames.

Client Meeting Intelligence Tool

Industry Internal Tool / AI

Duration Built in-house

Team 2 engineers

Gemini

Multimodal AI

< 3 min

Processing time

Audio+Video

Dual analysis

Docker

Cloud Run ready

Why We Built This

Meetings generate decisions, action items, and context that everyone forgets 30 minutes later. Transcription tools exist, but a transcript isn't intelligence — it's a wall of text. We wanted something that actually understands what happened in a meeting and delivers structured, actionable output.

We built this internally as a tool we use ourselves — and as a demonstration of what multimodal AI can do when applied to a real workflow problem.

What It Does

Feed it a meeting recording (MP4). It produces:

Meeting summary — concise overview of what was discussed, not a transcript dump
Key decisions and action items — extracted from the conversation with speaker attribution
Discussion flow diagram — a Mermaid flowchart showing how the conversation moved between topics, decisions, and action items
AI-selected key frames — the tool analyzes video frames (screen shares, diagrams, whiteboard sketches) and picks the 3-5 most relevant ones
Transcript excerpt — key moments with speaker labels, not the full transcript

What makes this different from transcription tools: This isn't Otter.ai. The tool analyzes audio AND video simultaneously — it sees what's on screen (code, diagrams, slides) and correlates it with what's being said. A screen share of an architecture diagram during a discussion about system design gets flagged as a key frame. A transcription tool would just give you the words.

How It Works

The pipeline has two stages:

Stage 1: Preprocessing — FFmpeg extracts the audio track (MP3) and captures keyframes from the video at 1 frame per 10 seconds.

Stage 2: Intelligence Generation — Both the audio and the keyframes are uploaded to Gemini's Files API. The model processes them together — audio for conversation content, frames for visual context — and returns a structured JSON report conforming to a strict output schema.

The structured output schema enforces that every report contains the same fields: summary, mermaid code, transcript excerpt, and relevant frame names. No freeform text. No inconsistent formats. Every report is machine-parseable.

Tech Stack

Node.js Gemini 2.5 Flash FFmpeg Docker Worker Pools

Gemini 2.5 Flash — multimodal model that processes audio and images together in a single call
FFmpeg — audio extraction and keyframe capture from video
Worker pools — parallel frame upload and base64 encoding for performance
Structured output (JSON schema) — enforced output format, no freeform LLM responses
Docker — containerized for deployment to Cloud Run or any container platform

Architecture Decision: Why Gemini

Most meeting intelligence tools use a two-step approach: transcribe with Whisper, then summarize with GPT. We chose a different path — Gemini processes the raw audio directly, alongside the video frames, in a single multimodal call.

Why: A two-step approach loses context. The transcription step strips out tone, emphasis, and timing. It also can't correlate what's being said with what's on screen. Gemini's native multimodal capability means the model hears the discussion about "this architecture" while simultaneously seeing the architecture diagram being shared on screen.

The Result

A 30-minute meeting recording is processed in under 3 minutes. The output is a structured JSON report that can be rendered as an email, dashboard card, or fed into any downstream system.

We use this tool internally — and we're integrating it into our sales process. Every discovery call with a prospect gets processed by Meeting Intelligence, and the prospect receives the AI-generated report as a follow-up. The call itself becomes a product demo.

Want something like this built?

Tell us the problem. We'll tell you what 72 hours can produce.

Book a Call → View More Case Studies