How It Works
When you ingest a video, Mixpeek runs a multi-stage pipeline:- Chunking — Videos split into segments using scene detection, silence detection, or fixed intervals
- Parallel Extraction — Multiple extractors run concurrently:
- Transcription: Whisper extracts speech-to-text with timestamps
- Visual Embeddings: Multimodal model generates embeddings from keyframes
- Thumbnails: Representative frames extracted for each segment
- Description & OCR — Gemini generates segment descriptions and extracts on-screen text
- Multi-Vector Indexing — Separate embeddings for transcription and visual content enable hybrid search
Feature Extractors
| Extractor | Outputs |
|---|---|
video_extractor@v1 | Scene embeddings, keyframes, timestamps |
audio_extractor@v1 | Transcription, speaker diarization |
text_extractor@v1 | Text embeddings, OCR from frames |
face_extractor@v1 | Face embeddings, bounding boxes |
1. Create a Bucket
2. Create Collections
For scenes:3. Ingest Videos
4. Process
5. Create a Hybrid Retriever
Combine visual and transcript search:6. Search
Text query:Moment-Level Search
Filter by timestamp:Speaker-Specific Search
With diarization enabled:Output Example
Scene document fromvideo_extractor@v1:
Parameters
| Parameter | Effect |
|---|---|
scene_detection_threshold | Lower = more scenes (0.2-0.5) |
keyframe_interval | Seconds between keyframes |
max_scenes | Cap scenes per video |
transcription_model | whisper-base (fast) or whisper-large-v3 (accurate) |

