Apply Taxonomy to Existing Documents

curl --request POST \
  --url https://api.mixpeek.com/v1/collections/{collection_identifier}/apply-taxonomy \
  --header 'Content-Type: application/json' \
  --data '
{
  "taxonomy_id": "<string>",
  "scroll_filters": {
    "must": [
      {
        "key": "metadata.category",
        "match": {
          "value": "products"
        }
      }
    ]
  },
  "batch_size": 1000,
  "parallelism": 4
}
'

{
  "task_id": "<string>",
  "status": "<string>",
  "collection_id": "<string>",
  "taxonomy_id": "<string>",
  "estimated_documents": 123
}

Taxonomies

Apply Taxonomy to Existing Documents

Apply a taxonomy to all existing documents in a collection retroactively.

This endpoint triggers distributed Ray processing to enrich existing documents with taxonomy data. Unlike automatic materialization (which happens during ingestion), this endpoint allows you to:

Backfill enrichment for documents ingested before the taxonomy was created
Re-apply taxonomy after configuration changes
Process specific subsets using scroll_filters

⚙️ Processing Details:

Uses Ray datasets with map_batches for parallel processing
Scales horizontally across Ray cluster
Non-blocking: Returns immediately with task_id
Monitor progress via Tasks API

⚠️ Prerequisites:

Taxonomy must exist and be valid
Taxonomy must be in collection’s taxonomy_applications list
Collection must contain documents

📊 Performance:

~1000-5000 docs/second depending on cluster size
Parallel processing across multiple Ray workers
Batch size and parallelism configurable

🔍 Use Cases:

Backfill: Apply new taxonomy to historical data
Re-enrichment: Update after taxonomy changes
Selective: Process filtered document subsets

See Collections API and Taxonomies API documentation for details.

POST

collections

{collection_identifier}

apply-taxonomy

Apply Taxonomy to Existing Documents

curl --request POST \
  --url https://api.mixpeek.com/v1/collections/{collection_identifier}/apply-taxonomy \
  --header 'Content-Type: application/json' \
  --data '
{
  "taxonomy_id": "<string>",
  "scroll_filters": {
    "must": [
      {
        "key": "metadata.category",
        "match": {
          "value": "products"
        }
      }
    ]
  },
  "batch_size": 1000,
  "parallelism": 4
}
'

{
  "task_id": "<string>",
  "status": "<string>",
  "collection_id": "<string>",
  "taxonomy_id": "<string>",
  "estimated_documents": 123
}

Headers

Authorization

string

REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.

Examples:

"Bearer YOUR_API_KEY"

"Bearer YOUR_STRIPE_API_KEY"

authorization

string

X-Namespace

string

REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'

Examples:

"ns_abc123def456"

"production"

"my-namespace"

Path Parameters

collection_identifier

string

required

Collection ID or name to apply taxonomy to

Body

application/json

Request to apply a taxonomy to an existing collection.

This endpoint triggers retroactive taxonomy materialization on all documents in a collection using distributed Ray processing.

Use Cases: - Apply taxonomy to documents that were ingested before the taxonomy was created - Re-apply taxonomy after taxonomy configuration changes - Backfill enrichment data for existing collections

Requirements: - taxonomy_id: REQUIRED - Must be an existing, valid taxonomy - The taxonomy must already be attached to the collection via taxonomy_applications - Documents must exist in the collection

taxonomy_id

string

required

ID of the taxonomy to apply. REQUIRED. Must be an existing taxonomy (tax_*). The taxonomy must already be in the collection's taxonomy_applications list.

Examples:

"tax_abc123"

"tax_products"

scroll_filters

Scroll Filters · object

Optional Qdrant filters to limit which documents are enriched. NOT REQUIRED. If not provided, all documents in the collection will be enriched. Use to process specific subsets (e.g., documents missing enrichment).

Example:

{
  "must": [
    {
      "key": "metadata.category",
      "match": { "value": "products" }
    }
  ]
}

batch_size

integer

default:1000

Number of documents to process in each parallel batch. NOT REQUIRED. Defaults to 1000. Larger batches = fewer Ray tasks but more memory per task. Smaller batches = more Ray tasks but lower memory per task.

Required range: 100 <= x <= 5000

Examples:

1000

500

2000

parallelism

integer

default:4

Number of parallel Ray workers to use for processing. NOT REQUIRED. Defaults to 4. Higher parallelism = faster processing but more cluster resources. Set based on available Ray cluster capacity.

Required range: 1 <= x <= 20

Examples:

4

8

2

Response

Successful Response

Response from applying taxonomy to collection.

Returns statistics about the materialization process.

task_id

string

required

ID of the Ray task executing the materialization

status

string

required

Status of the materialization task

Examples:

"submitted"

"running"

"completed"

"failed"

collection_id

string

required

Collection ID where taxonomy is being applied

taxonomy_id

string

required

Taxonomy ID being applied

estimated_documents

integer | null

Estimated number of documents to process (if available)

Get decomposition tree visualization Create Retriever

⌘I

Namespaces

Buckets

Feature Extractors

Collections

Retrievers

Taxonomies

Clusters

Templates

Manifest

Resource Search

Inference

Tasks

Webhooks

Alerts

Apply Taxonomy to Existing Documents

Headers

Path Parameters

Body

Response