Skip to main content
Collections convert objects into documents by running configured feature extractors on the Engine. They define input mappings, passthrough fields, and enrichment hooks. Documents inherit lineage back to the original object no matter how many tiers of collections you chain together.

Create a Collection

curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-documents",
    "description": "Semantic embeddings for product copy",
    "source": {
      "type": "bucket",
      "bucket_id": "<bucket_id>"
    },
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "product_text"
      },
      "field_passthrough": [
        { "source_path": "metadata.category" },
        { "source_path": "metadata.brand" }
      ],
      "parameters": {
        "model": "multilingual-e5-large-instruct",
        "normalize": true
      }
    },
    "taxonomy_applications": [
      {
        "taxonomy_id": "tax_product_categories",
        "execution_mode": "materialize"
      }
    ]
  }'
Key fields:
  • source – either { "type": "bucket", "bucket_id": ... } or { "type": "collection", "collection_id": ... }.
  • feature_extractor – singular object in v2 (breaking change from v1). Configure model, input mappings, passthrough fields, and runtime parameters.
  • taxonomy_applications – optional list of taxonomies to materialize (materialize) or attach on demand (on_demand).

Deterministic Output Schema

Collections compute an output_schema immediately upon creation based on:
  1. Field passthrough definitions
  2. Feature extractor output schema
You can query it via GET /v1/collections/{collection_id} before any documents exists—perfect for setting up downstream systems.

Multi-Tier Processing

Collections can reference other collections as sources to build decomposition trees:
Bucket (videos) ──▶ Collection A (frames)
                    └─▶ Collection B (scenes)
                           └─▶ Collection C (scene embeddings)
  • Tiered processing respects dependencies. Ray processes a tier only after upstream tiers finish.
  • Every document stores root_object_id, source_collection_id, source_document_id, lineage_path, and processing_tier.

Collection Operations

  • Get collection: GET /v1/collections/{id}
  • List collections: POST /v1/collections/list (filter by document count, taxonomy count, etc.)
  • Update taxonomy applications / metadata / enabled flag: PATCH /v1/collections/{id}
  • Delete collection: DELETE /v1/collections/{id}

Documents Produced

Each document includes:
  • Passthrough metadata (metadata.category, metadata.brand, etc.)
  • Feature URI outputs (e.g., text_extractor_v1_embedding)
  • source_blobs referencing original object blobs
  • Optional document_blobs with extractor artifacts (thumbnails, intermediate results)
  • internal_metadata.processing_history when include_processing_history is set on batch submission
Use /v1/collections/{collection_id}/documents/list for structured filters, pagination, and presigned URLs, or /v1/retrievers for pipeline-based retrieval.

Best Practices

  1. Keep collections focused—one extractor per collection (v2) encourages composable pipelines.
  2. Use passthrough fields to expose object metadata for taxonomy joins and retriever filters.
  3. Leverage source.type=collection for multi-tier workflows (e.g., video → frames → scenes).
  4. Attach taxonomies either as materialized enrichments or retriever stages, depending on latency and cost requirements.
  5. Monitor document counts with GET /v1/collections/{id} to plan Qdrant shard sizing.