Collections

Collections convert objects into documents by running configured feature extractors on the Engine. They define input mappings, passthrough fields, and enrichment hooks. Documents inherit lineage back to the original object no matter how many tiers of collections you chain together.

Create a Collection

curl -sS -X POST "$MP_API_URL/v1/collections" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_name": "product-documents",
    "description": "Semantic embeddings for product copy",
    "source": {
      "type": "bucket",
      "bucket_id": "<bucket_id>"
    },
    "feature_extractor": {
      "feature_extractor_name": "text_extractor",
      "version": "v1",
      "input_mappings": {
        "text": "product_text"
      },
      "field_passthrough": [
        { "source_path": "metadata.category" },
        { "source_path": "metadata.brand" }
      ],
      "parameters": {
        "model": "multilingual-e5-large-instruct",
        "normalize": true
      }
    },
    "taxonomy_applications": [
      {
        "taxonomy_id": "tax_product_categories",
        "execution_mode": "materialize"
      }
    ]
  }'

Key fields:

source – either { "type": "bucket", "bucket_id": ... } or { "type": "collection", "collection_id": ... }.
feature_extractor – singular object in v2 (breaking change from v1). Configure model, input mappings, passthrough fields, and runtime parameters.
taxonomy_applications – optional list of taxonomies to materialize (materialize) or attach on demand (on_demand).

Deterministic Output Schema

Collections compute an output_schema immediately upon creation based on:

Field passthrough definitions
Feature extractor output schema

You can query it via GET /v1/collections/{collection_id} before any documents exists—perfect for setting up downstream systems.

Multi-Tier Processing

Collections can reference other collections as sources to build decomposition trees:

Bucket (videos) ──▶ Collection A (frames)
                    └─▶ Collection B (scenes)
                           └─▶ Collection C (scene embeddings)

Tiered processing respects dependencies. Ray processes a tier only after upstream tiers finish.
Every document stores root_object_id, source_collection_id, source_document_id, lineage_path, and processing_tier.

Collection Operations

Get collection: GET /v1/collections/{id}
List collections: POST /v1/collections/list (filter by document count, taxonomy count, etc.)
Update taxonomy applications / metadata / enabled flag: PATCH /v1/collections/{id}
Delete collection: DELETE /v1/collections/{id}

Documents Produced

Each document includes:

Passthrough metadata (metadata.category, metadata.brand, etc.)
Feature URI outputs (e.g., text_extractor_v1_embedding)
source_blobs referencing original object blobs
Optional document_blobs with extractor artifacts (thumbnails, intermediate results)
internal_metadata.processing_history when include_processing_history is set on batch submission

Use /v1/collections/{collection_id}/documents/list for structured filters, pagination, and presigned URLs, or /v1/retrievers for pipeline-based retrieval.

Best Practices

Keep collections focused—one extractor per collection (v2) encourages composable pipelines.
Use passthrough fields to expose object metadata for taxonomy joins and retriever filters.
Leverage source.type=collection for multi-tier workflows (e.g., video → frames → scenes).
Attach taxonomies either as materialized enrichments or retriever stages, depending on latency and cost requirements.
Monitor document counts with GET /v1/collections/{id} to plan Qdrant shard sizing.

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Create a Collection

Deterministic Output Schema

Multi-Tier Processing

Collection Operations

Documents Produced

Best Practices

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Create a Collection

​Deterministic Output Schema

​Multi-Tier Processing

​Collection Operations

​Documents Produced

​Best Practices

Create a Collection

Deterministic Output Schema

Multi-Tier Processing

Collection Operations

Documents Produced

Best Practices