Skip to main content
Multimodal extractor pipeline showing video splitting, parallel processing with Whisper and Vertex AI, and output features
The multimodal extractor processes video, image, text, and GIF content using unified Vertex embeddings (1408D). Videos and GIFs are decomposed into segments with transcription (Whisper), visual embeddings, OCR, and descriptions. Images and text are embedded directly without decomposition.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/multimodal_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

When to Use

Use CaseDescription
Video content librariesSearch and navigate video segments by content
Media platformsSearch across spoken and visual content
Educational contentFind moments in lectures and tutorials
Surveillance/securityEvent detection in footage
Social mediaProcess user-generated video content
Broadcasting/streamingLarge video catalog management
Marketing analyticsAnalyze video campaigns
Cross-modal searchFind videos/images using text queries

When NOT to Use

ScenarioRecommended Alternative
Static image collections onlyimage_extractor
Audio-only contentaudio_extractor
Very short videos (< 5 seconds)Processing overhead not worth it
Real-time live streamsSpecialized streaming extractors
8K+ resolution videoConsider downsampling first

Supported Input Types

InputTypeDescriptionProcessing
videostringURL or S3 pathDecomposed into segments
imagestringURL or S3 pathDirect embedding (no decomposition)
textstringPlain text contentDirect embedding
gifstringURL or S3 pathTreated as video, frame-by-frame
Supported formats:
  • Video: MP4, MOV, AVI, MKV, WebM, FLV
  • Image: JPG, PNG, WebP, BMP
  • GIF: Animated GIF

Input Schema

Provide one of the following inputs:
{
  "video": "s3://bucket/videos/lecture.mp4"
}
{
  "image": "https://cdn.example.com/products/laptop.jpg"
}
{
  "text": "High-performance laptop with M3 chip, perfect for developers"
}
FieldTypeDescription
videostringURL/S3 path to video file. Recommended: 720p-1080p, < 2 hours
imagestringURL/S3 path to image file. Recommended: < 10MB
textstringPlain text for cross-modal embedding
gifstringURL/S3 path to GIF file
custom_thumbnailstringOptional custom thumbnail URL instead of auto-generated

Output Schema

Each video segment produces one document with the following fields:
FieldTypeDescription
start_timenumberSegment start time in seconds
end_timenumberSegment end time in seconds
transcriptionstringTranscribed audio content
descriptionstringAI-generated segment description
ocr_textstringText extracted from video frames
thumbnail_urlstringS3 URL of thumbnail image
source_video_urlstringOriginal source video URL
video_segment_urlstringURL of this specific segment
multimodal_extractor_v1_multimodal_embeddingfloat[1408]Visual/multimodal embedding
multimodal_extractor_v1_transcription_embeddingfloat[1024]Transcription text embedding
{
  "start_time": 10.0,
  "end_time": 20.0,
  "transcription": "Welcome to today's lecture on machine learning fundamentals...",
  "description": "Instructor standing at whiteboard, introducing ML concepts",
  "ocr_text": "Machine Learning 101",
  "thumbnail_url": "s3://mixpeek-storage/ns_123/thumbnails/thumb_1.jpg",
  "multimodal_extractor_v1_multimodal_embedding": [0.023, -0.041, ...],
  "multimodal_extractor_v1_transcription_embedding": [0.018, -0.032, ...]
}

Parameters

Video Splitting

ParameterTypeDefaultDescription
split_methodstring"time"Primary video splitting strategy: time, scene, or silence

Split Methods

Fixed interval splitting - Splits video into segments of equal duration.
ParameterTypeDefaultDescription
time_split_intervalinteger10Interval in seconds for each segment
Characteristics:
  • Predictable segment count: video_duration / interval
  • Consistent chunk sizes for uniform processing
  • May cut mid-sentence or mid-scene
Best for: General purpose, consistent chunking, when you need predictable segment counts
{
  "split_method": "time",
  "time_split_interval": 10
}

Split Methods Comparison

MethodSegments/MinPredictabilityBest For
time60 / interval_secHighGeneral purpose, batch processing
sceneVariable (2-20)LowMovies, ads, dynamic visual content
silenceVariable (5-30)MediumLectures, podcasts, spoken content

Feature Extraction Parameters

ParameterTypeDefaultDescription
run_transcriptionbooleantrueRun Whisper transcription on audio
transcription_languagestring"en"Language for transcription
run_transcription_embeddingbooleantrueGenerate embeddings for transcriptions
run_multimodal_embeddingbooleantrueGenerate Vertex multimodal embeddings
run_video_descriptionbooleanfalseGenerate AI descriptions (adds 1-2 min)
run_ocrbooleanfalseExtract text from video frames

Thumbnail Parameters

ParameterTypeDefaultDescription
enable_thumbnailsbooleantrueGenerate thumbnail images
use_cdnbooleanfalseUse CloudFront CDN for thumbnails
CDN benefits: Faster global delivery, permanent URLs, reduced bandwidth costs.

Description Generation Parameters

ParameterTypeDefaultDescription
description_promptstring"Describe the video segment in detail."Prompt for Gemini
generation_config.temperaturefloat0.7Randomness (higher = more creative)
generation_config.max_output_tokensinteger1024Maximum description length
generation_config.top_pfloat0.8Nucleus sampling

LLM Structured Extraction

ParameterTypeDefaultDescription
response_shapestring | objectnullCustom structured output schema
Natural Language Mode:
{
  "response_shape": "Extract product names, colors, materials, and aesthetic style labels from this fashion segment"
}
JSON Schema Mode:
{
  "response_shape": {
    "type": "object",
    "properties": {
      "products": {
        "type": "array",
        "items": {
          "type": "object",
          "properties": {
            "name": { "type": "string" },
            "category": { "type": "string" },
            "visibility_percentage": { "type": "integer", "minimum": 0, "maximum": 100 }
          }
        }
      },
      "aesthetic": { "type": "string" }
    }
  }
}

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "multimodal_extractor",
    "version": "v1",
    "input_mappings": {
      "video": "payload.video_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.video_id" }
    ],
    "parameters": {
      "split_method": "time",
      "time_split_interval": 10,
      "run_transcription": true,
      "run_multimodal_embedding": true,
      "enable_thumbnails": true
    }
  }
}

Performance & Costs

Processing Speed

Content TypeSpeed
Video0.5-2x realtime (depends on features enabled)
Image< 1 second
Text< 100ms
Example: 10-minute video → 5-20 minutes processing time
FeatureLatency per Segment
Transcription~200ms per second of audio
Visual embedding~50ms
OCR~300ms
Description~2s

Cost Estimates (per minute of video)

ConfigurationCost
Minimal (transcription + embeddings)$0.01
Standard (+ OCR)$0.05
Full (+ descriptions)$0.15
Images: 0.001perimageText:0.001 per image **Text**: 0.0001 per query

Vector Indexes

Multimodal Embedding

PropertyValue
Index namemultimodal_extractor_v1_multimodal_embedding
Dimensions1408
TypeDense
Distance metricCosine
Inference modelvertex_multimodal_embedding
Supported inputsvideo, text, image

Transcription Embedding

PropertyValue
Index namemultimodal_extractor_v1_transcription_embedding
Dimensions1024
TypeDense
Distance metricCosine
Inference modelmultilingual_e5_large_instruct_v1
Supported inputstext, string

Limitations

  • Video duration: Recommend < 2 hours for optimal processing
  • Resolution: 8K+ videos should be downsampled
  • Real-time: Not suitable for live streaming
  • Short videos: < 5 second videos have disproportionate overhead
  • Audio quality: Transcription accuracy depends on audio clarity
  • OCR/Description: Add significant processing time, enable only when needed

Collection-to-Collection Pipelines

The video_segment_url output enables decomposition chains:
  1. Initial collection: Time-based segments (5s intervals)
  2. Downstream collection: Scene detection within each segment
  3. Final collection: Enhanced processing with different models
{
  "input_mappings": {
    "video": "video_segment_url"
  }
}