View the live list of extractors at api.mixpeek.com/v1/collections/features/extractors or fetch programmatically with
GET /v1/collections/features/extractors.Fully Managed & Versioned
All feature extractors are built, maintained, and hosted by Mixpeek. Every extractor is versioned (text_extractor_v1, multimodal_extractor_v2, etc.), giving you:
- Version pinning – Lock to a specific version for reproducible results across environments
- Safe migrations – Swap extractor versions via namespace migrations without disrupting production
- A/B testing – Run multiple versions in parallel namespaces to compare quality and performance
- Zero maintenance – We handle model updates, infrastructure scaling, and GPU provisioning
Enterprise: Bring Your Own Models
Enterprise: Bring Your Own Models
Enterprise customers can extend the extractor library with custom models for specialized use cases:
- ONNX – Deploy optimized models exported from any framework
- TensorFlow – Run SavedModel or Keras models directly
- PyTorch – Load TorchScript or eager-mode models
- Custom code – Write Python extractors with arbitrary preprocessing, postprocessing, or multi-model pipelines
Available Extractors
| Extractor | Description | Input Types | Dimensions | Latency | Cost |
|---|---|---|---|---|---|
passthrough_extractor_v1 | Copies source fields without processing. Use for passing metadata or vectors between collections. | Any | N/A | <1ms | Free |
text_extractor_v1 | Dense embeddings via E5-Large multilingual. Supports chunking (characters, words, sentences, paragraphs, pages) and LLM extraction via response_shape. | text, string | 1024 | ~5ms/doc | Free |
multimodal_extractor_v1 | Unified embeddings for video, image, text, and GIF via Google Vertex AI. Videos decomposed by time, scene, or silence with transcription, OCR, and thumbnails. | video, image, text, string | 1408 | 0.5-2x realtime | $0.01-0.15/min |
Where Extractors Run
- Mixpeek submits Ray jobs tier-by-tier using manifests stored in S3.
- Workers download per-extractor Parquet artifacts, execute model inference (GPU if available), and write results to Qdrant and MongoDB.
- Engine pollers respect dependency tiers so downstream collections only run when upstream data is ready.
Configuring Extractors
Extractors are attached to a collection via the singularfeature_extractor field:
- Input mappings use JSONPath-like paths (
metadata.category,$.payload.description) to map object fields to extractor inputs. - Field passthrough copies original object fields into the resulting documents.
- Parameters set model-specific options (e.g., alternative embedding models, chunk strategies).
output_schema that merges passthrough fields and extractor outputs, so you can validate downstream systems before processing begins.
Feature URIs
Every extractor output publishes a URI (mixpeek://{name}@{version}/{output}), for example:
mixpeek://text_extractor@v1/text_embeddingmixpeek://video_extractor@v1/scene_embeddingmixpeek://splade_extractor@v1/splade_vector
Pipeline Flow
- API flattens manifest → extractor row artifacts (Parquet) and stores them in S3.
- Ray poller discovers a pending batch and submits a job.
- Worker loads dataset, runs the extractor flow, and emits features + passthrough fields.
- QdrantBatchProcessor writes vectors/payloads to Qdrant, emits webhook events, and updates index signatures.
Performance & Scaling
- GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing.
- Ray Data handles batching, shuffling, and parallelization automatically.
- Autoscaling policies maintain target utilization (
0.7CPU,0.8GPU by default). - Inference cache short-circuits repeated model calls when inputs match exactly.
Operational Tips
- Pin versions – upgrade by creating a new collection or reprocessing existing documents.
- Batch uploads – keep batches to manageable sizes (1k–10k objects) to maximize parallelism.
- Monitor analytics –
/v1/analytics/extractors/performanceexposes throughput and error metrics (stub available now; enable analytics to populate). - Use passthrough fields – propagate metadata you’ll need for retrieval filters or taxonomies.
- Inspect features –
GET /v1/collections/{id}/featureslists the feature URIs, dimensions, and descriptions produced for that collection.

