Create a Collection
source– either{ "type": "bucket", "bucket_id": ... }or{ "type": "collection", "collection_id": ... }.feature_extractor– singular object in v2 (breaking change from v1). Configure model, input mappings, passthrough fields, and runtime parameters.taxonomy_applications– optional list of taxonomies to materialize (materialize) or attach on demand (on_demand).
Deterministic Output Schema
Collections compute anoutput_schema immediately upon creation based on:
- Field passthrough definitions
- Feature extractor output schema
GET /v1/collections/{collection_id} before any documents exists—perfect for setting up downstream systems.
Multi-Tier Processing
Collections can reference other collections as sources to build decomposition trees:- Tiered processing respects dependencies. Ray processes a tier only after upstream tiers finish.
- Every document stores
root_object_id,source_collection_id,source_document_id,lineage_path, andprocessing_tier.
Collection Operations
- Get collection:
GET /v1/collections/{id} - List collections:
POST /v1/collections/list(filter by document count, taxonomy count, etc.) - Update taxonomy applications / metadata / enabled flag:
PATCH /v1/collections/{id} - Delete collection:
DELETE /v1/collections/{id}
Documents Produced
Each document includes:- Passthrough metadata (
metadata.category,metadata.brand, etc.) - Feature URI outputs (e.g.,
text_extractor_v1_embedding) source_blobsreferencing original object blobs- Optional
document_blobswith extractor artifacts (thumbnails, intermediate results) internal_metadata.processing_historywheninclude_processing_historyis set on batch submission
/v1/collections/{collection_id}/documents/list for structured filters, pagination, and presigned URLs, or /v1/retrievers for pipeline-based retrieval.
Best Practices
- Keep collections focused—one extractor per collection (v2) encourages composable pipelines.
- Use passthrough fields to expose object metadata for taxonomy joins and retriever filters.
- Leverage
source.type=collectionfor multi-tier workflows (e.g., video → frames → scenes). - Attach taxonomies either as materialized enrichments or retriever stages, depending on latency and cost requirements.
- Monitor document counts with
GET /v1/collections/{id}to plan Qdrant shard sizing.

