Skip to main content
Feature extractor pipeline showing objects being processed through ML models to produce reusable features
Feature extractors are Ray-powered workflows that read objects, run ML models, and write features into collection documents. Each extractor exposes a stable feature URI so retrievers, taxonomies, and clusters can reference the outputs with confidence.
View the live list of extractors at api.mixpeek.com/v1/collections/features/extractors or fetch programmatically with GET /v1/collections/features/extractors.

Fully Managed & Versioned

All feature extractors are built, maintained, and hosted by Mixpeek. Every extractor is versioned (text_extractor_v1, multimodal_extractor_v2, etc.), giving you:
  • Version pinning – Lock to a specific version for reproducible results across environments
  • Safe migrations – Swap extractor versions via namespace migrations without disrupting production
  • A/B testing – Run multiple versions in parallel namespaces to compare quality and performance
  • Zero maintenance – We handle model updates, infrastructure scaling, and GPU provisioning
When a new version ships, your existing collections continue using the pinned version until you explicitly migrate.
Enterprise customers can extend the extractor library with custom models for specialized use cases:
  • ONNX – Deploy optimized models exported from any framework
  • TensorFlow – Run SavedModel or Keras models directly
  • PyTorch – Load TorchScript or eager-mode models
  • Custom code – Write Python extractors with arbitrary preprocessing, postprocessing, or multi-model pipelines
Custom extractors run on the same Ray infrastructure with full GPU support, versioning, and observability. Contact your account team to enable BYO model deployment.

Available Extractors

ExtractorDescriptionInput TypesDimensionsLatencyCost
passthrough_extractor_v1Copies source fields without processing. Use for passing metadata or vectors between collections.AnyN/A<1msFree
text_extractor_v1Dense embeddings via E5-Large multilingual. Supports chunking (characters, words, sentences, paragraphs, pages) and LLM extraction via response_shape.text, string1024~5ms/docFree
multimodal_extractor_v1Unified embeddings for video, image, text, and GIF via Google Vertex AI. Videos decomposed by time, scene, or silence with transcription, OCR, and thumbnails.video, image, text, string14080.5-2x realtime$0.01-0.15/min
Need something custom? Enterprise customers can bring their own models or contact support to enable additional extractors.

Where Extractors Run

  • Mixpeek submits Ray jobs tier-by-tier using manifests stored in S3.
  • Workers download per-extractor Parquet artifacts, execute model inference (GPU if available), and write results to Qdrant and MongoDB.
  • Engine pollers respect dependency tiers so downstream collections only run when upstream data is ready.

Configuring Extractors

Extractors are attached to a collection via the singular feature_extractor field:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "product_text"
    },
    "field_passthrough": [
      { "source_path": "metadata.category" },
      { "source_path": "metadata.brand" }
    ],
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "normalize": true
    }
  }
}
  • Input mappings use JSONPath-like paths (metadata.category, $.payload.description) to map object fields to extractor inputs.
  • Field passthrough copies original object fields into the resulting documents.
  • Parameters set model-specific options (e.g., alternative embedding models, chunk strategies).
The collection immediately calculates an output_schema that merges passthrough fields and extractor outputs, so you can validate downstream systems before processing begins.

Feature URIs

Every extractor output publishes a URI (mixpeek://{name}@{version}/{output}), for example:
  • mixpeek://text_extractor@v1/text_embedding
  • mixpeek://video_extractor@v1/scene_embedding
  • mixpeek://splade_extractor@v1/splade_vector
Use these URIs when configuring retrievers, taxonomies, clustering jobs, and analytics. They guarantee query-time compatibility with ingestion.

Pipeline Flow

  1. API flattens manifest → extractor row artifacts (Parquet) and stores them in S3.
  2. Ray poller discovers a pending batch and submits a job.
  3. Worker loads dataset, runs the extractor flow, and emits features + passthrough fields.
  4. QdrantBatchProcessor writes vectors/payloads to Qdrant, emits webhook events, and updates index signatures.
Each document records lineage metadata:
{
  "root_object_id": "obj_123",
  "source_collection_id": "col_source",
  "processing_tier": 2,
  "feature_address": "mixpeek://text_extractor@v1/text_embedding"
}

Performance & Scaling

  • GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing.
  • Ray Data handles batching, shuffling, and parallelization automatically.
  • Autoscaling policies maintain target utilization (0.7 CPU, 0.8 GPU by default).
  • Inference cache short-circuits repeated model calls when inputs match exactly.

Operational Tips

  1. Pin versions – upgrade by creating a new collection or reprocessing existing documents.
  2. Batch uploads – keep batches to manageable sizes (1k–10k objects) to maximize parallelism.
  3. Monitor analytics/v1/analytics/extractors/performance exposes throughput and error metrics (stub available now; enable analytics to populate).
  4. Use passthrough fields – propagate metadata you’ll need for retrieval filters or taxonomies.
  5. Inspect featuresGET /v1/collections/{id}/features lists the feature URIs, dimensions, and descriptions produced for that collection.
Feature extractors are the heart of Mixpeek’s Engine. Configure them once, reuse the outputs everywhere, and rely on Ray + Qdrant to keep performance predictable at scale.