Skip to main content
Taxonomies let you attach structured metadata to documents by matching them against a reference collection. They are implemented as retriever-powered joins and can run on demand or be materialized into collections.

Taxonomy Types

TypeStructureWhen to Use
FlatSingle-level reference collectionFace enrollment, entity linking, simple lookups
HierarchicalParent/child nodes with inheritanceOrg charts, product categories, multi-level labeling
Each node references a collection, retriever, and list of enrichment fields. Child nodes inherit parent properties automatically.

Execution Modes

ModeDescriptionUse Case
on_demandEnrich documents at query time inside a retriever (taxonomy@v1 stage)Exploratory workflows, testing, dynamic reference data
materializeBatch enrichment after extraction; results persisted in the collectionProduction search, low-latency retrieval, analytics
Configure execution mode via a collection’s taxonomy_applications array or by adding a taxonomy stage to a retriever.

Internals: JOIN Stage

Taxonomies reuse the join@v1 stage under the hood:
  • Direct join – key-based match (join_type: "direct").
  • Retriever join – similarity match using a nested retriever (join_type: "retriever").
  • Join strategiesreplace, enrich, left, or append control how fields merge.
Parallel execution (asyncio.gather) makes retrieval joins 10–50× faster than sequential lookups.

Create a Flat Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "employee_faces",
    "taxonomy_type": "flat",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "source_collection": {
      "collection_id": "col_employee_embeddings",
      "enrichment_fields": [
        { "field_path": "metadata.name", "merge_mode": "enrich" },
        { "field_path": "metadata.department", "merge_mode": "enrich" }
      ]
    }
  }'

Create a Hierarchical Taxonomy

curl -sS -X POST "$MP_API_URL/v1/taxonomies" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "taxonomy_name": "org_roles",
    "taxonomy_type": "hierarchical",
    "retriever_id": "ret_face_matcher",
    "input_mappings": {
      "query_embedding": "mixpeek://face_detector@v2/face_embedding"
    },
    "hierarchy": [
      {
        "node_id": "employees",
        "collection_id": "col_employee_embeddings",
        "enrichment_fields": ["metadata.employee_id", "metadata.department"]
      },
      {
        "node_id": "executives",
        "collection_id": "col_executives",
        "retriever_id": "ret_executive_face",
        "parent_node_id": "employees",
        "enrichment_fields": ["metadata.executive_level", "metadata.budget_authority"]
      }
    ]
  }'
Hierarchical nodes inherit parent enrichment properties; children can override or extend them.

Attach to a Collection

{
  "taxonomy_applications": [
    {
      "taxonomy_id": "tax_employee_faces",
      "execution_mode": "materialize"
    },
    {
      "taxonomy_id": "tax_org_roles",
      "execution_mode": "on_demand"
    }
  ]
}
  • Materialized enrichment updates documents ~30 seconds after ingestion completes (debounced to avoid thrashing).
  • On-demand enrichment keeps documents untouched; retrievers call the taxonomy join at query time.

Test On Demand

curl -sS -X POST "$MP_API_URL/v1/taxonomies/<taxonomy_id>/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "source_documents": [
      {
        "document_id": "doc_scene_123",
        "mixpeek://face_detector@v2/face_embedding": [0.12, 0.34, ...]
      }
    ],
    "mode": "on_demand"
  }'

Inference Strategies

  • Manual – Define nodes explicitly (IDs, collections, retrievers).
  • Schema-based – Infer nodes from existing collection schemas (planned).
  • Cluster-based – Create nodes from clustering output.
  • LLM-based – Generate hierarchical structure from sample documents.
Combining strategies is encouraged: bootstrap via inference, then fine-tune manually.

Monitoring

  • List taxonomies: POST /v1/taxonomies/list
  • Inspect hierarchy and node metadata: GET /v1/taxonomies/{id}?expand_nodes=true
  • Track materialized enrichment progress via webhook events (collection.documents.written)
  • Use retriever analytics to ensure taxonomy stages don’t dominate latency.

Best Practices

  1. Start flat for quick wins; layer hierarchies once value is proven.
  2. Keep enrichment minimal—copy only fields needed at query time.
  3. Cache taxonomy stages in retrievers when reference collections rarely change.
  4. Version taxonomies (via snapshots) before major structural changes.
  5. Combine with clusters to discover candidate nodes and measure coverage.
Taxonomies let you inject domain knowledge into multimodal search—link documents to canonical entities without relying on brittle key joins.