View extractor details at api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.When to Use
| Use Case | Description |
|---|---|
| Video content libraries | Search and navigate video segments by content |
| Media platforms | Search across spoken and visual content |
| Educational content | Find moments in lectures and tutorials |
| Surveillance/security | Event detection in footage |
| Social media | Process user-generated video content |
| Broadcasting/streaming | Large video catalog management |
| Marketing analytics | Analyze video campaigns |
| Cross-modal search | Find videos/images using text queries |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Static image collections only | image_extractor |
| Audio-only content | audio_extractor |
| Very short videos (< 5 seconds) | Processing overhead not worth it |
| Real-time live streams | Specialized streaming extractors |
| 8K+ resolution video | Consider downsampling first |
Supported Input Types
| Input | Type | Description | Processing |
|---|---|---|---|
video | string | URL or S3 path | Decomposed into segments |
image | string | URL or S3 path | Direct embedding (no decomposition) |
text | string | Plain text content | Direct embedding |
gif | string | URL or S3 path | Treated as video, frame-by-frame |
- Video: MP4, MOV, AVI, MKV, WebM, FLV
- Image: JPG, PNG, WebP, BMP
- GIF: Animated GIF
Input Schema
Provide one of the following inputs:| Field | Type | Description |
|---|---|---|
video | string | URL/S3 path to video file. Recommended: 720p-1080p, < 2 hours |
image | string | URL/S3 path to image file. Recommended: < 10MB |
text | string | Plain text for cross-modal embedding |
gif | string | URL/S3 path to GIF file |
custom_thumbnail | string | Optional custom thumbnail URL instead of auto-generated |
Output Schema
Each video segment produces one document with the following fields:| Field | Type | Description |
|---|---|---|
start_time | number | Segment start time in seconds |
end_time | number | Segment end time in seconds |
transcription | string | Transcribed audio content |
description | string | AI-generated segment description |
ocr_text | string | Text extracted from video frames |
thumbnail_url | string | S3 URL of thumbnail image |
source_video_url | string | Original source video URL |
video_segment_url | string | URL of this specific segment |
multimodal_extractor_v1_multimodal_embedding | float[1408] | Visual/multimodal embedding |
multimodal_extractor_v1_transcription_embedding | float[1024] | Transcription text embedding |
Parameters
Video Splitting
| Parameter | Type | Default | Description |
|---|---|---|---|
split_method | string | "time" | Primary video splitting strategy: time, scene, or silence |
Split Methods
- time
- scene
- silence
Fixed interval splitting - Splits video into segments of equal duration.
Characteristics:
| Parameter | Type | Default | Description |
|---|---|---|---|
time_split_interval | integer | 10 | Interval in seconds for each segment |
- Predictable segment count:
video_duration / interval - Consistent chunk sizes for uniform processing
- May cut mid-sentence or mid-scene
Split Methods Comparison
| Method | Segments/Min | Predictability | Best For |
|---|---|---|---|
time | 60 / interval_sec | High | General purpose, batch processing |
scene | Variable (2-20) | Low | Movies, ads, dynamic visual content |
silence | Variable (5-30) | Medium | Lectures, podcasts, spoken content |
Feature Extraction Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
run_transcription | boolean | true | Run Whisper transcription on audio |
transcription_language | string | "en" | Language for transcription |
run_transcription_embedding | boolean | true | Generate embeddings for transcriptions |
run_multimodal_embedding | boolean | true | Generate Vertex multimodal embeddings |
run_video_description | boolean | false | Generate AI descriptions (adds 1-2 min) |
run_ocr | boolean | false | Extract text from video frames |
Thumbnail Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
enable_thumbnails | boolean | true | Generate thumbnail images |
use_cdn | boolean | false | Use CloudFront CDN for thumbnails |
Description Generation Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
description_prompt | string | "Describe the video segment in detail." | Prompt for Gemini |
generation_config.temperature | float | 0.7 | Randomness (higher = more creative) |
generation_config.max_output_tokens | integer | 1024 | Maximum description length |
generation_config.top_p | float | 0.8 | Nucleus sampling |
LLM Structured Extraction
| Parameter | Type | Default | Description |
|---|---|---|---|
response_shape | string | object | null | Custom structured output schema |
Configuration Examples
Performance & Costs
Processing Speed
| Content Type | Speed |
|---|---|
| Video | 0.5-2x realtime (depends on features enabled) |
| Image | < 1 second |
| Text | < 100ms |
| Feature | Latency per Segment |
|---|---|
| Transcription | ~200ms per second of audio |
| Visual embedding | ~50ms |
| OCR | ~300ms |
| Description | ~2s |
Cost Estimates (per minute of video)
| Configuration | Cost |
|---|---|
| Minimal (transcription + embeddings) | $0.01 |
| Standard (+ OCR) | $0.05 |
| Full (+ descriptions) | $0.15 |
Vector Indexes
Multimodal Embedding
| Property | Value |
|---|---|
| Index name | multimodal_extractor_v1_multimodal_embedding |
| Dimensions | 1408 |
| Type | Dense |
| Distance metric | Cosine |
| Inference model | vertex_multimodal_embedding |
| Supported inputs | video, text, image |
Transcription Embedding
| Property | Value |
|---|---|
| Index name | multimodal_extractor_v1_transcription_embedding |
| Dimensions | 1024 |
| Type | Dense |
| Distance metric | Cosine |
| Inference model | multilingual_e5_large_instruct_v1 |
| Supported inputs | text, string |
Limitations
- Video duration: Recommend < 2 hours for optimal processing
- Resolution: 8K+ videos should be downsampled
- Real-time: Not suitable for live streaming
- Short videos: < 5 second videos have disproportionate overhead
- Audio quality: Transcription accuracy depends on audio clarity
- OCR/Description: Add significant processing time, enable only when needed
Collection-to-Collection Pipelines
Thevideo_segment_url output enables decomposition chains:
- Initial collection: Time-based segments (5s intervals)
- Downstream collection: Scene detection within each segment
- Final collection: Enhanced processing with different models

