Feature Extractors

Feature extractor pipeline showing objects being processed through ML models to produce reusable features

Feature extractors are Ray-powered workflows that read objects, run ML models, and write features into collection documents. Each extractor exposes a stable feature URI so retrievers, taxonomies, and clusters can reference the outputs with confidence.

View the live list of extractors at api.mixpeek.com/v1/collections/features/extractors or fetch programmatically with GET /v1/collections/features/extractors.

Fully Managed & Versioned

All feature extractors are built, maintained, and hosted by Mixpeek. Every extractor is versioned (text_extractor_v1, multimodal_extractor_v2, etc.), giving you:

Version pinning – Lock to a specific version for reproducible results across environments
Safe migrations – Swap extractor versions via namespace migrations without disrupting production
A/B testing – Run multiple versions in parallel namespaces to compare quality and performance
Zero maintenance – We handle model updates, infrastructure scaling, and GPU provisioning

When a new version ships, your existing collections continue using the pinned version until you explicitly migrate.

Enterprise: Bring Your Own Models

Enterprise customers can extend the extractor library with custom models for specialized use cases:

ONNX – Deploy optimized models exported from any framework
TensorFlow – Run SavedModel or Keras models directly
PyTorch – Load TorchScript or eager-mode models
Custom code – Write Python extractors with arbitrary preprocessing, postprocessing, or multi-model pipelines

Custom extractors run on the same Ray infrastructure with full GPU support, versioning, and observability. Contact your account team to enable BYO model deployment.

Available Extractors

Extractor	Description	Input Types	Dimensions	Latency	Cost
`passthrough_extractor_v1`	Copies source fields without processing. Use for passing metadata or vectors between collections.	Any	N/A	<1ms	Free
`text_extractor_v1`	Dense embeddings via E5-Large multilingual. Supports chunking (characters, words, sentences, paragraphs, pages) and LLM extraction via `response_shape`.	`text`, `string`	1024	~5ms/doc	Free
`multimodal_extractor_v1`	Unified embeddings for video, image, text, and GIF via Google Vertex AI. Videos decomposed by time, scene, or silence with transcription, OCR, and thumbnails.	`video`, `image`, `text`, `string`	1408	0.5-2x realtime	$0.01-0.15/min

Need something custom? Enterprise customers can bring their own models or contact support to enable additional extractors.

Where Extractors Run

Mixpeek submits Ray jobs tier-by-tier using manifests stored in S3.
Workers download per-extractor Parquet artifacts, execute model inference (GPU if available), and write results to Qdrant and MongoDB.
Engine pollers respect dependency tiers so downstream collections only run when upstream data is ready.

Configuring Extractors

Extractors are attached to a collection via the singular feature_extractor field:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "product_text"
    },
    "field_passthrough": [
      { "source_path": "metadata.category" },
      { "source_path": "metadata.brand" }
    ],
    "parameters": {
      "model": "multilingual-e5-large-instruct",
      "normalize": true
    }
  }
}

Input mappings use JSONPath-like paths (metadata.category, $.payload.description) to map object fields to extractor inputs.
Field passthrough copies original object fields into the resulting documents.
Parameters set model-specific options (e.g., alternative embedding models, chunk strategies).

The collection immediately calculates an output_schema that merges passthrough fields and extractor outputs, so you can validate downstream systems before processing begins.

Feature URIs

Every extractor output publishes a URI (mixpeek://{name}@{version}/{output}), for example:

mixpeek://text_extractor@v1/text_embedding
mixpeek://video_extractor@v1/scene_embedding
mixpeek://splade_extractor@v1/splade_vector

Use these URIs when configuring retrievers, taxonomies, clustering jobs, and analytics. They guarantee query-time compatibility with ingestion.

Pipeline Flow

API flattens manifest → extractor row artifacts (Parquet) and stores them in S3.
Ray poller discovers a pending batch and submits a job.
Worker loads dataset, runs the extractor flow, and emits features + passthrough fields.
QdrantBatchProcessor writes vectors/payloads to Qdrant, emits webhook events, and updates index signatures.

Each document records lineage metadata:

{
  "root_object_id": "obj_123",
  "source_collection_id": "col_source",
  "processing_tier": 2,
  "feature_address": "mixpeek://text_extractor@v1/text_embedding"
}

Performance & Scaling

GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing.
Ray Data handles batching, shuffling, and parallelization automatically.
Autoscaling policies maintain target utilization (0.7 CPU, 0.8 GPU by default).
Inference cache short-circuits repeated model calls when inputs match exactly.

Operational Tips

Pin versions – upgrade by creating a new collection or reprocessing existing documents.
Batch uploads – keep batches to manageable sizes (1k–10k objects) to maximize parallelism.
Monitor analytics – /v1/analytics/extractors/performance exposes throughput and error metrics (stub available now; enable analytics to populate).
Use passthrough fields – propagate metadata you’ll need for retrieval filters or taxonomies.
Inspect features – GET /v1/collections/{id}/features lists the feature URIs, dimensions, and descriptions produced for that collection.

Feature extractors are the heart of Mixpeek’s Engine. Configure them once, reuse the outputs everywhere, and rely on Ray + Qdrant to keep performance predictable at scale.

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Troubleshoot

Analytics

Feature Extractors

Fully Managed & Versioned

Available Extractors

Where Extractors Run

Configuring Extractors

Feature URIs

Pipeline Flow

Performance & Scaling

Operational Tips

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Troubleshoot

Analytics

​Fully Managed & Versioned

​Available Extractors

​Where Extractors Run

​Configuring Extractors

​Feature URIs

​Pipeline Flow

​Performance & Scaling

​Operational Tips

Fully Managed & Versioned

Available Extractors

Where Extractors Run

Configuring Extractors

Feature URIs

Pipeline Flow

Performance & Scaling

Operational Tips