Collections convert objects into documents by running configured feature extractors on the Engine. They define input mappings, passthrough fields, and enrichment hooks. Documents inherit lineage back to the original object no matter how many tiers of collections you chain together.
Create a Collection
curl -sS -X POST "$MP_API_URL/v1/collections" \
-H "Authorization: Bearer $MP_API_KEY" \
-H "X-Namespace: $MP_NAMESPACE" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "product-documents",
"description": "Semantic embeddings for product copy",
"source": {
"type": "bucket",
"bucket_id": "<bucket_id>"
},
"feature_extractor": {
"feature_extractor_name": "text_extractor",
"version": "v1",
"input_mappings": {
"text": "product_text"
},
"field_passthrough": [
{ "source_path": "metadata.category" },
{ "source_path": "metadata.brand" }
],
"parameters": {
"model": "multilingual-e5-large-instruct",
"normalize": true
}
},
"taxonomy_applications": [
{
"taxonomy_id": "tax_product_categories",
"execution_mode": "materialize"
}
]
}'
Key fields:
source – either { "type": "bucket", "bucket_ids": [...] } or { "type": "collection", "collection_id": ... }.
source.source_namespace_id – optional. Reference buckets from a different namespace in the same organization. See Cross-Namespace Sources below.
feature_extractor – singular object in v2 (breaking change from v1). Configure model, input mappings, passthrough fields, and runtime parameters.
taxonomy_applications – optional list of taxonomies to materialize (materialize) or attach on demand (on_demand).
Deterministic Output Schema
Collections compute an output_schema immediately upon creation based on:
- Field passthrough definitions
- Feature extractor output schema
You can query it via GET /v1/collections/{collection_id} before any documents exists—perfect for setting up downstream systems.
Multi-Tier Processing
Collections can reference other collections as sources to build decomposition trees:
Bucket (videos) ──▶ Collection A (frames)
└─▶ Collection B (scenes)
└─▶ Collection C (scene embeddings)
- Tiered processing respects dependencies. Ray processes a tier only after upstream tiers finish.
- Every document stores
root_object_id, source_collection_id, source_document_id, lineage_path, and processing_tier.
Cross-Namespace Sources
Collections can process buckets from a different namespace within the same organization using source_namespace_id. Documents still land in the collection’s own namespace.
curl -sS -X POST "$MP_API_URL/v1/collections" \
-H "Authorization: Bearer $MP_API_KEY" \
-H "X-Namespace: $NAMESPACE_B" \
-H "Content-Type: application/json" \
-d '{
"collection_name": "faces-from-shared-videos",
"source": {
"type": "bucket",
"bucket_ids": ["<bucket_id_in_namespace_a>"],
"source_namespace_id": "<namespace_a_id>"
},
"feature_extractor": {
"feature_extractor_name": "face_identity_extractor",
"version": "v1",
"input_mappings": { "video": "video" }
}
}'
Use cases:
- Shared raw data – upload videos once in namespace A, run different extractors in namespaces B and C.
- Team isolation – each team owns a namespace but references a central data bucket.
- Environment promotion – process staging buckets from a production namespace.
source_namespace_id is only valid for bucket sources (type: "bucket"). Collection-to-collection sources must be within the same namespace.
Collection Operations
- Get collection:
GET /v1/collections/{id}
- List collections:
POST /v1/collections/list (filter by document count, taxonomy count, etc.)
- Update taxonomy applications / metadata / enabled flag:
PATCH /v1/collections/{id}
- Delete collection:
DELETE /v1/collections/{id}
Documents Produced
Each document includes:
- Passthrough metadata (
metadata.category, metadata.brand, etc.)
- Feature URI outputs (e.g.,
text_extractor_v1_embedding)
source_blobs referencing original object blobs
- Optional
document_blobs with extractor artifacts (thumbnails, intermediate results)
internal_metadata.processing_history when include_processing_history is set on batch submission
Use /v1/collections/{collection_id}/documents/list for structured filters, pagination, and presigned URLs, or /v1/retrievers for pipeline-based retrieval.
Best Practices
- Keep collections focused—one extractor per collection (v2) encourages composable pipelines.
- Use passthrough fields to expose object metadata for taxonomy joins and retriever filters.
- Leverage
source.type=collection for multi-tier workflows (e.g., video → frames → scenes).
- Attach taxonomies either as materialized enrichments or retriever stages, depending on latency and cost requirements.
- Monitor document counts with
GET /v1/collections/{id} to plan Qdrant shard sizing.