Skip to main content
Video Understanding Pipeline

How It Works

When you ingest a video, Mixpeek runs a multi-stage pipeline:
  1. Chunking — Videos split into segments using scene detection, silence detection, or fixed intervals
  2. Parallel Extraction — Multiple extractors run concurrently:
    • Transcription: Whisper extracts speech-to-text with timestamps
    • Visual Embeddings: Multimodal model generates embeddings from keyframes
    • Thumbnails: Representative frames extracted for each segment
  3. Description & OCR — Gemini generates segment descriptions and extracts on-screen text
  4. Multi-Vector Indexing — Separate embeddings for transcription and visual content enable hybrid search
At query time, the retriever searches across both visual and transcript embeddings, fusing results to find moments by what’s shown or what’s said.

Feature Extractors

ExtractorOutputs
video_extractor@v1Scene embeddings, keyframes, timestamps
audio_extractor@v1Transcription, speaker diarization
text_extractor@v1Text embeddings, OCR from frames
face_extractor@v1Face embeddings, bounding boxes

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "video-catalog",
  "schema": {
    "properties": {
      "video_url": { "type": "url", "required": true },
      "title": { "type": "text" },
      "category": { "type": "text" }
    }
  }
}

2. Create Collections

For scenes:
POST /v1/collections
{
  "collection_name": "video-scenes",
  "source": { "type": "bucket", "bucket_id": "bkt_videos" },
  "feature_extractor": {
    "feature_extractor_name": "video_extractor",
    "version": "v1",
    "input_mappings": { "video_url": "video_url" },
    "parameters": {
      "scene_detection_threshold": 0.3,
      "keyframe_interval": 30,
      "max_scenes": 100
    },
    "field_passthrough": [
      { "source_path": "title" },
      { "source_path": "category" }
    ]
  }
}
For transcripts:
POST /v1/collections
{
  "collection_name": "video-transcripts",
  "source": { "type": "bucket", "bucket_id": "bkt_videos" },
  "feature_extractor": {
    "feature_extractor_name": "audio_extractor",
    "version": "v1",
    "input_mappings": { "audio_url": "video_url" },
    "parameters": {
      "transcription_model": "whisper-large-v3",
      "language": "en",
      "enable_diarization": true
    },
    "field_passthrough": [
      { "source_path": "title" },
      { "source_path": "category" }
    ]
  }
}

3. Ingest Videos

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/marketing/demos",
  "metadata": {
    "title": "Product Launch Q4 2025",
    "category": "marketing"
  },
  "blobs": [
    {
      "property": "video_url",
      "type": "video",
      "url": "s3://my-bucket/demos/product-launch.mp4"
    }
  ]
}

4. Process

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_video_001"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Hybrid Retriever

Combine visual and transcript search:
POST /v1/retrievers
{
  "retriever_name": "video-search",
  "collection_ids": ["col_video_scenes", "col_video_transcripts"],
  "input_schema": {
    "properties": {
      "query_text": { "type": "text", "required": true },
      "query_image": { "type": "url" },
      "category": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "hybrid_search",
      "version": "v1",
      "parameters": {
        "queries": [
          {
            "feature_address": "mixpeek://video_extractor@v1/scene_embedding",
            "input_mapping": { "image": "query_image" },
            "weight": 0.6
          },
          {
            "feature_address": "mixpeek://audio_extractor@v1/transcript_embedding",
            "input_mapping": { "text": "query_text" },
            "weight": 0.4
          }
        ],
        "fusion_method": "rrf",
        "limit": 20
      }
    },
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.category",
          "operator": "eq",
          "value": "{{inputs.category}}"
        }
      }
    }
  ]
}
Text query:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query_text": "someone explaining product features",
    "category": "marketing"
  },
  "limit": 10
}
Image query (find similar scenes):
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query_image": "s3://my-bucket/reference-scene.jpg",
    "query_text": "product demonstration"
  },
  "limit": 10
}
Filter by timestamp:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": { "query_text": "pricing discussion" },
  "filters": {
    "field": "segment_metadata.start_time",
    "operator": "gte",
    "value": 60.0
  }
}
With diarization enabled:
{
  "filters": {
    "field": "metadata.speaker_id",
    "operator": "eq",
    "value": "SPEAKER_001"
  }
}

Output Example

Scene document from video_extractor@v1:
{
  "document_id": "doc_scene_123",
  "source_object_id": "obj_video_001",
  "metadata": {
    "title": "Product Launch Q4 2025",
    "scene_index": 3,
    "start_time": 45.2,
    "end_time": 58.7,
    "keyframe_url": "s3://my-bucket/keyframes/scene_003.jpg"
  }
}

Parameters

ParameterEffect
scene_detection_thresholdLower = more scenes (0.2-0.5)
keyframe_intervalSeconds between keyframes
max_scenesCap scenes per video
transcription_modelwhisper-base (fast) or whisper-large-v3 (accurate)