Business Impact: Transform hours of video into searchable moments. Find “scenes where someone mentions pricing” across 10,000 videos in milliseconds. Enable visual search, auto-moderation, highlight reels, and compliance monitoring without watching a single frame.
vs Building It Yourself
| Task | Without Mixpeek | With Mixpeek |
|---|---|---|
| Video processing pipeline (FFmpeg, codecs) | 6-8 weeks | Instant |
| GPU cluster for CLIP embeddings | 4-6 weeks | Instant |
| Whisper transcription deployment | 3-4 weeks | Instant |
| Scene detection & keyframe extraction | 4-5 weeks | Config change |
| Multi-modal search (visual + audio) | 6-8 weeks | 1 hour |
| Distributed processing at scale | 8-12 weeks | Built-in |
Key Differentiator: Process scenes, audio, faces, and on-screen text in parallel with one API call. Ray handles distribution, GPU scheduling, and retries—you just configure extractors and query results.
Object Decomposition
Feature Extractors
| Extractor | Outputs | Use Cases |
|---|---|---|
| video_extractor@v1 | Scene embeddings (CLIP), keyframes, timestamps | Visual search, scene similarity, highlight detection |
| audio_extractor@v1 | Transcription (Whisper), audio embeddings, speaker diarization | Dialogue search, podcast indexing, accessibility |
| text_extractor@v1 | Text embeddings (E5), OCR text from frames | On-screen text search, subtitle generation |
| face_extractor@v1 | Face embeddings (ArcFace), bounding boxes | Character tracking, person search |
Implementation Steps
1. Create a Bucket for Videos
2. Define Collections for Different Modalities
Scene-Level Collection:3. Register Video Objects
4. Process with Batch
- Download the video from S3
- Run scene detection and extract keyframes
- Generate CLIP embeddings for each scene
- Transcribe audio with Whisper
- Create documents in both collections with lineage references
5. Build a Hybrid Retriever
Combine visual and textual search for comprehensive video queries:6. Execute Searches
Text Query:Model Evolution & A/B Testing
Test different scene detection thresholds, transcription models, and embedding versions without rebuilding your entire video catalog.Test Scene Detection Parameters
Compare Transcription Models
Measure Impact
- Scenes per video: v1 (12) vs v2 (28) → better granularity
- Transcription accuracy: v1 (92%) vs v2 (97%) → fewer search misses
- Processing cost: v1 (0.05 credits) vs v2 (0.12 credits) → 2.4x cost
- User satisfaction: v1 (3.2/5) vs v2 (4.1/5) → worth the upgrade
Seamless Migration
Advanced Patterns
Highlight Generation
Use clustering to identify key moments:Moment-Level Search
Filter by timestamp to find specific segments:Speaker-Specific Search
If audio extractor enables diarization:Cross-Video Analysis
Search across multiple videos by omitting object-level filters:Output Schema Example
Scene documents produced byvideo_extractor@v1:
Performance Considerations
| Optimization | Impact |
|---|---|
| Scene detection threshold | Lower = more scenes but slower processing. Tune between 0.2-0.5 |
| Keyframe interval | Extract fewer frames (every 60s vs 30s) for faster processing |
| Max scenes limit | Cap scenes per video to control document count and cost |
| Transcription model | Use whisper-base for speed, whisper-large-v3 for accuracy |
| Batch size | Process 10-50 videos per batch for optimal throughput |
Use Case Examples
Video Search Engine
Video Search Engine
Enable users to search video libraries by text (“find scenes with dogs”) or image (“find similar product shots”). Combine
video_extractor scenes with audio_extractor transcripts for comprehensive coverage.Content Moderation
Content Moderation
Use
image_extractor on keyframes to detect inappropriate content. Filter documents with taxonomy tags like “violence” or “adult_content” and flag for review.Automated Captioning
Automated Captioning
Use
audio_extractor transcripts to generate SRT/VTT caption files. Enrich with text_extractor OCR to capture on-screen text not spoken aloud.Video Summarization
Video Summarization
Cluster scenes with
kmeans, select one keyframe per cluster, and use llm_generation stage to create a textual summary based on transcript segments.Character Tracking
Character Tracking
Use
face_extractor to identify characters across scenes. Group by face_embedding to track appearances and generate character timelines.Next Steps
- Explore Feature Extractors for full parameter documentation
- Learn Hybrid Search fusion strategies
- Review Clusters for highlight generation workflows
- Check Batching Best Practices for efficient video ingestion

