Taxonomies in Mixpeek allow you to classify, organize, and enrich your content with structured metadata, functioning as the multimodal analogue to SQL JOIN operations.

Overview

Taxonomies in Mixpeek are specialized structures that allow you to classify, organize, and enrich your data with structured metadata. They function similarly to JOIN operations in traditional databases, but operate on feature similarity rather than exact key matches, making them ideal for multimodal content. Conceptually, each taxonomy functions as a specialized collection with a defined schema. Associated retrievers handle the logic for looking up and enriching documents based on this taxonomy.

Watch an Intro Video

1

Select Taxonomy Type

Choose between flat or hierarchical taxonomy structure based on your organizational needs and data relationships.

2

Select Source Collection

Select the collection containing the reference data that will be used to enrich your documents.

3

Select Enrichment Fields

Choose which fields from the source collection will be added to enrich your target documents.

4

Select Retriever

Choose the retriever that will match documents from your target collection with the appropriate reference data.

5

Configure Input Fields

Select the input fields from the source collection for the retriever. These fields define what data is used for matching, and can include constants.

Content Classification

Create organized classification systems to categorize your multimodal content

Data Enrichment

Enrich documents with additional metadata based on their characteristics

Types of Taxonomies

Mixpeek supports two main types of taxonomies:

Creating a Taxonomy

Flat Taxonomy

from mixpeek import Mixpeek

mp = Mixpeek(api_key="YOUR_API_KEY")

# Create a flat taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="product_categories",
    description="Product category classification",
    taxonomy_type="flat",
    retriever={
        "retriever_id": "ret_def456",  # Existing retriever for matching
        "threshold": 0.7               # Similarity threshold
    },
    source_collections=[
        {
            "collection_id": "col_categories",
            "enrichment_fields": ["category", "department", "tax_rate"]
        }
    ]
)

taxonomy_id = taxonomy["taxonomy_id"]
print(f"Created taxonomy: {taxonomy_id}")

Hierarchical Taxonomy

# Create a hierarchical taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="organization_structure",
    description="Company organizational structure",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_ghi789"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_people",
                "parent_collection_id": None,  # Top-level collection
                "enrichment_fields": ["name", "basic_access"]
            },
            {
                "collection_id": "col_employees",
                "parent_collection_id": "col_people",  # Child of people
                "enrichment_fields": ["employee_id", "department"]
            },
            {
                "collection_id": "col_executives",
                "parent_collection_id": "col_employees",  # Child of employees
                "enrichment_fields": ["executive_level", "budget_authority"]
            }
        ]
    }
)

Applying Taxonomies

Once you’ve created a taxonomy, you can apply it to enrich documents in your collections.

Materialization Options

When applying taxonomies, you have two main materialization options:

Materialized

Creates enriched documents in a specified output collection

Benefits:

  • Faster query performance
  • Pre-computed enrichments
  • Historical enrichment tracking

Considerations:

  • Requires additional storage
  • Needs re-application when taxonomy changes

On-Demand

Computes enrichments during query execution

Benefits:

  • Always uses latest taxonomy
  • No duplicate storage required
  • Automatic updates with taxonomy changes

Considerations:

  • Higher query-time compute costs
  • Potentially slower query performance
  • Cannot track historical enrichment changes

Taxonomy Node Structure

In a hierarchical taxonomy, nodes are documents in collections with specific structures:

// Example: Executive node in a hierarchical taxonomy
{
  "document_id": "doc_abc123",
  "collection_id": "col_executives",
  // Person properties (inherited from col_people)
  "name": "Jane Smith",
  "basic_access": true,
  
  // Employee properties (inherited from col_employees)
  "employee_id": "E12345",
  "department": "Marketing",
  
  // Executive-specific properties
  "executive_level": "VP",
  "budget_authority": 5000000,
  
  // Embedding for matching
  "face_embedding": [0.1, 0.2, 0.3, ...]
}

Property Inheritance

In hierarchical taxonomies, properties are inherited from parent to child nodes:

  • Child nodes inherit all properties from their parent nodes
  • Child nodes can override inherited properties with their own values
  • Inheritance follows the collection hierarchy defined in the taxonomy
  • While explicit hierarchies use defined parent-child links for inheritance, implicit relationships facilitate a similar flow of enrichment data between linked documents identified during the retrieval process (e.g., a Scene document inheriting details from a matched Face document).

Enrichment and Retrieval Processes

The way taxonomies enrich and retrieve data differs between explicit and implicit hierarchies:

Comparison Table

FeatureExplicit HierarchyImplicit Hierarchy
Defined bySchema (parent, children, etc.)Data behavior or retriever chaining
EnrichmentRule-based, tree traversalDynamic, input-driven enrichment chains
RetrievalIN subtree("X"), static filtersChained stages (face → scene → cluster)
Query CompositionOne query with expanded filtersMulti-step query pipeline
ExamplesCategory trees, controlled vocabulariesFace → Scene → Episode, video tags

Using Enriched Documents

Once you’ve applied a taxonomy, you can use the enriched fields in searches and filters:

# Search using enriched taxonomy fields
results = mp.retrievers.execute(
    retriever_id="ret_jkl012",
    query={
        "text": "marketing presentation"
    },
    filters={
        "department": "Marketing",  # Taxonomy-enriched field
        "executive_level": {"$exists": True}  # Only match executive content
    }
)

Example Use Cases

Create a product taxonomy to automatically classify products and inherit category properties:

# Create product category taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="product_hierarchy",
    description="Product category hierarchy",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_product_matcher"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_categories",
                "parent_collection_id": None,
                "enrichment_fields": ["category", "tax_group"]
            },
            {
                "collection_id": "col_subcategories",
                "parent_collection_id": "col_categories",
                "enrichment_fields": ["subcategory", "warranty_policy"]
            },
            {
                "collection_id": "col_product_types",
                "parent_collection_id": "col_subcategories",
                "enrichment_fields": ["product_type", "return_window"]
            }
        ]
    }
)

This creates a three-level product hierarchy where each level inherits properties from its parent.

Best Practices

1

Design Thoughtful Hierarchies

For hierarchical taxonomies, carefully plan the levels and inheritance relationships to avoid redundancy and maximize usability.

2

Optimize Retriever Performance

Create efficient retrievers for taxonomy matching, focusing on the most distinctive features for each node type.

3

Consider Materialization Strategy

Choose between materialized and on-demand enrichment based on your query patterns, update frequency, and storage constraints.

4

Test with Representative Data

Validate taxonomy performance with representative test data before applying to your full collection.

Taxonomies with many nodes or complex hierarchies can impact application performance. Optimize your taxonomy structure and retriever configuration for the best balance of accuracy and performance.

Implementation Patterns

Dynamic Classification

Apply taxonomies to automatically classify new content as it’s ingested, using a pipeline hook to trigger taxonomy application.

Enriched Search

Use taxonomies to enrich documents with additional metadata that can be leveraged for more precise filtering and faceting in search.

Hierarchical Navigation

Create user interfaces that leverage hierarchical taxonomy structures for browsing and navigating content collections.

Compliance Tagging

Use taxonomies to automatically apply compliance or policy tags to content based on its characteristics.

API Reference

For complete details on working with taxonomies, see our Taxonomies API Reference.

Retriever Input Sources

The inputs required by a taxonomy’s retriever can be sourced at different points in the data lifecycle:

  • During Ingestion (Feature Extraction):

    • Raw uploaded data: Directly from file contents (text, image frames, audio segments).
    • Output of previous extractors: Using results from other processing steps (e.g., text from OCR used for entity recognition).
    • Static metadata: Information associated with the source document (filename, timestamp, user-provided tags).
  • During Query / Retrieval:

    • User query inputs: Text, filters, or embeddings provided in the search request.
    • Intermediate outputs: Results from earlier stages in a multi-stage retrieval pipeline (e.g., a face_id found by one retriever stage is used as input for a scene taxonomy retriever).

Query pipelines can be configured to dynamically wire these inputs and outputs together, enabling complex, chained enrichment and retrieval workflows.