Skip to main content
Taxonomies let you attach structured metadata to documents by matching them against a reference collection. They are implemented as retriever-powered joins and can run on demand or be materialized into collections.

Taxonomy Types

TypeStructureWhen to Use
FlatSingle-level reference collectionFace enrollment, entity linking, simple lookups
HierarchicalParent/child nodes with inheritanceOrg charts, product categories, multi-level labeling
Each node references a collection, retriever, and list of enrichment fields. Child nodes inherit parent properties automatically.

Flat Taxonomy: Product Catalog Recognition

Flat taxonomy showing multimodal documents matched against a product catalog reference collection
In a flat taxonomy, documents from any modality (video, image, audio, text) are matched against a single reference collection. Each document uses its appropriate feature embedding (CLIP for visual, text embeddings for audio transcripts) to find the best match. Enrichment fields (SKU, category, price) are attached when similarity exceeds the threshold.

Hierarchical Taxonomy: Media Content Classification

Hierarchical taxonomy showing media content classification with multiple levels from brand to campaign
In a hierarchical taxonomy, documents traverse multiple levels of progressive refinement. Starting from a broad brand classification (1 node), through content category (2 nodes), sport/style type (4 nodes), audience segmentation (5 nodes), to specific campaigns (6 nodes). Each level narrows the classification using different multimodal features—CLIP for brand detection, scene classification for categories, activity detection for sport types, demographic models for audiences, and campaign-specific patterns at the final level. Documents inherit all properties from parent nodes as they traverse down the tree.

Execution Modes

ModeDescriptionUse Case
on_demandEnrich documents at query time inside a retriever (taxonomy_enrich stage)Exploratory workflows, testing, dynamic reference data
materializeBatch enrichment after extraction; results persisted in the collectionProduction search, low-latency retrieval, analytics
retroactiveApply taxonomy to existing documents in a collectionBackfilling, taxonomy updates, schema migrations
Configure execution mode via a collection’s taxonomy_applications array or by adding a taxonomy stage to a retriever.

How Hierarchical Taxonomies Execute

Hierarchical taxonomies are executed like Common Table Expressions (CTEs) in SQL—each level builds on the results of the previous level, creating a recursive evaluation chain from root to leaf nodes.
Level 1 (Root)     →  Match against brand collection
    ↓ passes matched docs
Level 2            →  Match against category collection (filtered by L1 result)
    ↓ passes matched docs
Level 3            →  Match against subcategory collection (filtered by L2 result)
    ↓ passes matched docs
Level N (Leaf)     →  Final enrichment fields attached
At each level:
  1. Documents that matched the parent node are passed down
  2. The child node’s retriever executes against its reference collection
  3. Enrichment fields from matching nodes are accumulated
  4. Only documents exceeding the similarity threshold continue to child nodes
This CTE-style execution ensures that a document classified as “Nike → Athletic → Running” inherits enrichment fields from all three levels, not just the leaf node.

Application Methods

Hierarchical taxonomies can be applied through three methods:
MethodWhen It RunsUse Case
On-demandQuery time, as a retriever stageDynamic classification, A/B testing taxonomy versions, low-volume queries
MaterializedDuring collection processing (post-extraction)Production search requiring low latency, analytics dashboards
RetroactiveManually triggered via APIBackfilling existing documents, applying updated taxonomy versions
On-demand enrichment adds a taxonomy_enrich stage to your retriever pipeline:
{
  "stage_type": "apply",
  "stage_id": "taxonomy_enrich",
  "parameters": {
    "taxonomy_id": "tax_org_roles",
    "input_mappings": {
      "query_embedding": "{{DOC.mixpeek://face_detector@v2/face_embedding}}"
    }
  }
}
Materialized enrichment runs automatically after document extraction completes. Configure it in the collection’s taxonomy_applications:
{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_product_hierarchy",
      "execution_mode": "materialize"
    }
  ]
}
Retroactive enrichment applies a taxonomy to documents already in the collection:
curl -sS -X POST "$MP_API_URL/v1/collections/<collection_id>/apply-taxonomy" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_id": "tax_product_hierarchy",
    "filter": {
      "field": "metadata.needs_reclassification",
      "operator": "eq",
      "value": true
    }
  }'
Use retroactive application when:
  • You’ve updated a taxonomy and need to reclassify existing documents
  • You’re migrating from a flat taxonomy to a hierarchical one
  • You’ve added new reference data to taxonomy collections

Internals: JOIN Stage

Taxonomies reuse the join@v1 stage under the hood:
  • Direct join – key-based match (join_type: "direct").
  • Retriever join – similarity match using a nested retriever (join_type: "retriever").
  • Join strategiesreplace, enrich, left, or append control how fields merge.
Parallel execution (asyncio.gather) makes retrieval joins 10–50× faster than sequential lookups.

Create a Flat Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'

Create a Hierarchical Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "org_roles",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "hierarchy": [
      {
        "node_id": "employees",
        "collection_id": "col_employee_embeddings",
        "enrichment_fields": ["metadata.employee_id", "metadata.department"]
      },
      {
        "node_id": "executives",
        "collection_id": "col_executives",
        "retriever_id": "ret_executive_face",
        "parent_node_id": "employees",
        "enrichment_fields": ["metadata.executive_level", "metadata.budget_authority"]
      }
    ]
  }'
Hierarchical nodes inherit parent enrichment properties; children can override or extend them.

Attach to a Collection

{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_employee_faces",
      "execution_mode": "materialize"
    },
    {
      "taxonomy_id": "tax_org_roles",
      "execution_mode": "on_demand"
    }
  ]
}
  • Materialized enrichment updates documents ~30 seconds after ingestion completes (debounced to avoid thrashing).
  • On-demand enrichment keeps documents untouched; retrievers call the taxonomy join at query time.

Test On Demand

curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      {
        "document_id": "doc_scene_123",
        "mixpeek://face_detector@v2/face_embedding": [0.12, 0.34, ...]
      }
    ],
    "mode": "on_demand"
  }'

Inference Strategies

  • Manual – Define nodes explicitly (IDs, collections, retrievers).
  • Schema-based – Infer nodes from existing collection schemas (planned).
  • Cluster-based – Create nodes from clustering output.
  • LLM-based – Generate hierarchical structure from sample documents.
Combining strategies is encouraged: bootstrap via inference, then fine-tune manually.

Monitoring

  • List taxonomies: POST /v1/taxonomies/list
  • Inspect hierarchy and node metadata: GET /v1/taxonomies/{id}?expand_nodes=true
  • Track materialized enrichment progress via webhook events (collection.documents.written)
  • Use retriever analytics to ensure taxonomy stages don’t dominate latency.

Best Practices

  1. Start flat for quick wins; layer hierarchies once value is proven.
  2. Keep enrichment minimal—copy only fields needed at query time.
  3. Cache taxonomy stages in retrievers when reference collections rarely change.
  4. Version taxonomies (via snapshots) before major structural changes.
  5. Combine with clusters to discover candidate nodes and measure coverage.
Taxonomies let you inject domain knowledge into multimodal search—link documents to canonical entities without relying on brittle key joins.