Meeting Intelligence: AI-Powered Meeting Analysis
A multimodal AI tool that processes meeting recordings and generates structured intelligence reports — summaries, key decisions, discussion flow diagrams, and AI-selected key frames.
Why We Built This
Meetings generate decisions, action items, and context that everyone forgets 30 minutes later. Transcription tools exist, but a transcript isn't intelligence — it's a wall of text. We wanted something that actually understands what happened in a meeting and delivers structured, actionable output.
We built this internally as a tool we use ourselves — and as a demonstration of what multimodal AI can do when applied to a real workflow problem.
What It Does
Feed it a meeting recording (MP4). It produces:
- Meeting summary — concise overview of what was discussed, not a transcript dump
- Key decisions and action items — extracted from the conversation with speaker attribution
- Discussion flow diagram — a Mermaid flowchart showing how the conversation moved between topics, decisions, and action items
- AI-selected key frames — the tool analyzes video frames (screen shares, diagrams, whiteboard sketches) and picks the 3-5 most relevant ones
- Transcript excerpt — key moments with speaker labels, not the full transcript
What makes this different from transcription tools: This isn't Otter.ai. The tool analyzes audio AND video simultaneously — it sees what's on screen (code, diagrams, slides) and correlates it with what's being said. A screen share of an architecture diagram during a discussion about system design gets flagged as a key frame. A transcription tool would just give you the words.
How It Works
The pipeline has two stages:
Stage 1: Preprocessing — FFmpeg extracts the audio track (MP3) and captures keyframes from the video at 1 frame per 10 seconds.
Stage 2: Intelligence Generation — Both the audio and the keyframes are uploaded to Gemini's Files API. The model processes them together — audio for conversation content, frames for visual context — and returns a structured JSON report conforming to a strict output schema.
The structured output schema enforces that every report contains the same fields: summary, mermaid code, transcript excerpt, and relevant frame names. No freeform text. No inconsistent formats. Every report is machine-parseable.
Tech Stack
- Gemini 2.5 Flash — multimodal model that processes audio and images together in a single call
- FFmpeg — audio extraction and keyframe capture from video
- Worker pools — parallel frame upload and base64 encoding for performance
- Structured output (JSON schema) — enforced output format, no freeform LLM responses
- Docker — containerized for deployment to Cloud Run or any container platform
Architecture Decision: Why Gemini
Most meeting intelligence tools use a two-step approach: transcribe with Whisper, then summarize with GPT. We chose a different path — Gemini processes the raw audio directly, alongside the video frames, in a single multimodal call.
Why: A two-step approach loses context. The transcription step strips out tone, emphasis, and timing. It also can't correlate what's being said with what's on screen. Gemini's native multimodal capability means the model hears the discussion about "this architecture" while simultaneously seeing the architecture diagram being shared on screen.
The Result
A 30-minute meeting recording is processed in under 3 minutes. The output is a structured JSON report that can be rendered as an email, dashboard card, or fed into any downstream system.
We use this tool internally — and we're integrating it into our sales process. Every discovery call with a prospect gets processed by Meeting Intelligence, and the prospect receives the AI-generated report as a follow-up. The call itself becomes a product demo.
Want something like this built?
Tell us the problem. We'll tell you what 72 hours can produce.