Skip to main content
Feature extractors are Ray-powered workflows that read objects, run ML models, and write features into collection documents. Each extractor exposes a stable feature URI so retrievers, taxonomies, and clusters can reference the outputs with confidence.
Fetch the authoritative list of extractors at runtime with GET /v1/collections/features/extractors. The API returns extractor ids, versions, supported input types, schemas, and default parameters.
curl -s --request GET \
  --url "$MP_API_URL/v1/collections/features/extractors" \
  --header "Authorization: Bearer $MP_API_KEY" \
  --header "X-Namespace: $MP_NAMESPACE"

Catalog Highlights

ExtractorOutputKey ModelsTypical LatencyUse Cases
text_extractor@v1Dense text embeddings plus language metadatamultilingual-e5-large-instruct, gte-modernbert-base, OpenAI text-embedding-3-*10–50 ms/doc (GPU)Semantic search, Q&A, classification
video_extractor@v1Scene slices, thumbnails, frame embeddingsCLIP, Pyannote, FFmpeg2–10 s/videoVideo retrieval, moderation, scene detection
splade_extractor@v1Sparse lexical vectors (indices + weights)naver/splade-v15–20 ms/docHybrid search, rare term recall
colbert_extractor@v1Multi-vector token embeddingscolbert_ir/colbertv220–80 ms/docLate interaction search, passage QA
whisper_large_v3_turboTranscripts with timestampsOpenAI Whisper500–2000 ms/min audioSpeech search, diarization
Need something custom or gated? Contact support to enable additional extractors or deploy bespoke models.

Where Extractors Run

  • Mixpeek submits Ray jobs tier-by-tier using manifests stored in S3.
  • Workers download per-extractor Parquet artifacts, execute model inference (GPU if available), and write results to Qdrant and MongoDB.
  • Engine pollers respect dependency tiers so downstream collections only run when upstream data is ready.

Configuring Extractors

Extractors are attached to a collection via the singular feature_extractor field:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "product_text"
    },
    "field_passthrough": [
      { "source_path": "metadata.category" },
      { "source_path": "metadata.brand" }
    ],
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "normalize": true
    }
  }
}
  • Input mappings use JSONPath-like paths (metadata.category, $.payload.description) to map object fields to extractor inputs.
  • Field passthrough copies original object fields into the resulting documents.
  • Parameters set model-specific options (e.g., alternative embedding models, chunk strategies).
The collection immediately calculates an output_schema that merges passthrough fields and extractor outputs, so you can validate downstream systems before processing begins.

Feature URIs

Every extractor output publishes a URI (mixpeek://{name}@{version}/{output}), for example:
  • mixpeek://text_extractor@v1/text_embedding
  • mixpeek://video_extractor@v1/scene_embedding
  • mixpeek://splade_extractor@v1/splade_vector
Use these URIs when configuring retrievers, taxonomies, clustering jobs, and analytics. They guarantee query-time compatibility with ingestion.

Pipeline Flow

  1. API flattens manifest → extractor row artifacts (Parquet) and stores them in S3.
  2. Ray poller discovers a pending batch and submits a job.
  3. Worker loads dataset, runs the extractor flow, and emits features + passthrough fields.
  4. QdrantBatchProcessor writes vectors/payloads to Qdrant, emits webhook events, and updates index signatures.
Each document records lineage metadata:
{
  "root_object_id": "obj_123",
  "source_collection_id": "col_source",
  "processing_tier": 2,
  "feature_address": "mixpeek://text_extractor@v1/text_embedding"
}

Performance & Scaling

  • GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing.
  • Ray Data handles batching, shuffling, and parallelization automatically.
  • Autoscaling policies maintain target utilization (0.7 CPU, 0.8 GPU by default).
  • Inference cache short-circuits repeated model calls when inputs match exactly.

Operational Tips

  1. Pin versions – upgrade by creating a new collection or reprocessing existing documents.
  2. Batch uploads – keep batches to manageable sizes (1k–10k objects) to maximize parallelism.
  3. Monitor analytics/v1/analytics/extractors/performance exposes throughput and error metrics (stub available now; enable analytics to populate).
  4. Use passthrough fields – propagate metadata you’ll need for retrieval filters or taxonomies.
  5. Inspect featuresGET /v1/collections/{id}/features lists the feature URIs, dimensions, and descriptions produced for that collection.
Feature extractors are the heart of Mixpeek’s Engine. Configure them once, reuse the outputs everywhere, and rely on Ray + Qdrant to keep performance predictable at scale.