Fetch the authoritative list of extractors at runtime with
GET /v1/collections/features/extractors. The API returns extractor ids, versions, supported input types, schemas, and default parameters.Catalog Highlights
| Extractor | Output | Key Models | Typical Latency | Use Cases |
|---|---|---|---|---|
text_extractor@v1 | Dense text embeddings plus language metadata | multilingual-e5-large-instruct, gte-modernbert-base, OpenAI text-embedding-3-* | 10–50 ms/doc (GPU) | Semantic search, Q&A, classification |
video_extractor@v1 | Scene slices, thumbnails, frame embeddings | CLIP, Pyannote, FFmpeg | 2–10 s/video | Video retrieval, moderation, scene detection |
splade_extractor@v1 | Sparse lexical vectors (indices + weights) | naver/splade-v1 | 5–20 ms/doc | Hybrid search, rare term recall |
colbert_extractor@v1 | Multi-vector token embeddings | colbert_ir/colbertv2 | 20–80 ms/doc | Late interaction search, passage QA |
whisper_large_v3_turbo | Transcripts with timestamps | OpenAI Whisper | 500–2000 ms/min audio | Speech search, diarization |
Where Extractors Run
- Mixpeek submits Ray jobs tier-by-tier using manifests stored in S3.
- Workers download per-extractor Parquet artifacts, execute model inference (GPU if available), and write results to Qdrant and MongoDB.
- Engine pollers respect dependency tiers so downstream collections only run when upstream data is ready.
Configuring Extractors
Extractors are attached to a collection via the singularfeature_extractor field:
- Input mappings use JSONPath-like paths (
metadata.category,$.payload.description) to map object fields to extractor inputs. - Field passthrough copies original object fields into the resulting documents.
- Parameters set model-specific options (e.g., alternative embedding models, chunk strategies).
output_schema that merges passthrough fields and extractor outputs, so you can validate downstream systems before processing begins.
Feature URIs
Every extractor output publishes a URI (mixpeek://{name}@{version}/{output}), for example:
mixpeek://text_extractor@v1/text_embeddingmixpeek://video_extractor@v1/scene_embeddingmixpeek://splade_extractor@v1/splade_vector
Pipeline Flow
- API flattens manifest → extractor row artifacts (Parquet) and stores them in S3.
- Ray poller discovers a pending batch and submits a job.
- Worker loads dataset, runs the extractor flow, and emits features + passthrough fields.
- QdrantBatchProcessor writes vectors/payloads to Qdrant, emits webhook events, and updates index signatures.
Performance & Scaling
- GPU workers deliver 5–10× faster throughput for embeddings, reranking, and video processing.
- Ray Data handles batching, shuffling, and parallelization automatically.
- Autoscaling policies maintain target utilization (
0.7CPU,0.8GPU by default). - Inference cache short-circuits repeated model calls when inputs match exactly.
Operational Tips
- Pin versions – upgrade by creating a new collection or reprocessing existing documents.
- Batch uploads – keep batches to manageable sizes (1k–10k objects) to maximize parallelism.
- Monitor analytics –
/v1/analytics/extractors/performanceexposes throughput and error metrics (stub available now; enable analytics to populate). - Use passthrough fields – propagate metadata you’ll need for retrieval filters or taxonomies.
- Inspect features –
GET /v1/collections/{id}/featureslists the feature URIs, dimensions, and descriptions produced for that collection.

