Taxonomies

Taxonomies in Mixpeek allow you to classify, organize, and enrich your content with structured metadata, functioning as the multimodal analogue to SQL JOIN operations.

Overview

Taxonomies in Mixpeek are specialized structures that allow you to classify, organize, and enrich your data with structured metadata. They function similarly to JOIN operations in traditional databases, but operate on feature similarity rather than exact key matches, making them ideal for multimodal content. Conceptually, each taxonomy functions as a specialized collection with a defined schema. Associated retrievers handle the logic for looking up and enriching documents based on this taxonomy. Watch an Intro Video

Select Taxonomy Type

Choose between flat or hierarchical taxonomy structure based on your organizational needs and data relationships.

Select Source Collection

Select the collection containing the reference data that will be used to enrich your documents.

Select Enrichment Fields

Choose which fields from the source collection will be added to enrich your target documents.

Select Retriever

Choose the retriever that will match documents from your target collection with the appropriate reference data.

Configure Input Fields

Define how the retriever gets its input data. This might involve mapping fields from source data, using outputs from previous processing steps, or incorporating query-time information.

Content Classification

Create organized classification systems to categorize your multimodal content

Data Enrichment

Enrich documents with additional metadata based on their characteristics

Types of Taxonomies

Mixpeek supports two main types of taxonomies:

Flat Taxonomies

Flat taxonomies are simple, single-level classification systems that enrich documents with metadata from a reference collection. Key characteristics:

Function as a join operation between collections
Add metadata or tags from one collection to another
No hierarchical relationships between categories

Use Cases:

Content & Topic Tagging

Product Attribute Enrichment

Compliance & Safety Classification

Example: Content Tagging

Create a flat taxonomy for automatically tagging content with relevant topics:

# Create content tagging taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="content_topics",
    description="Content topic classification",
    taxonomy_type="flat",
    retriever={
        "retriever_id": "ret_topic_matcher",
        "threshold": 0.6
    },
    source_collections=[
        {
            "collection_id": "col_topics",
            "enrichment_fields": ["topic", "subtopics", "audience_level"]
        }
    ]
)

Apply this taxonomy to automatically tag articles, blog posts, or other content with relevant topics.

Hierarchical Taxonomies

Hierarchical taxonomies organize categories in parent-child relationships within a single taxonomy, allowing for more complex classification systems. This is often referred to as intra-taxonomy hierarchy. Key characteristics:

Multi-level organization with parent-child relationships
Property inheritance from parent to child nodes
Support for complex nested categorization

Hierarchies can be implemented in two ways:

Explicit Hierarchy
- Parent-child relationships defined directly in configuration
- Clear, intentional structure specification
- Recommended for complex hierarchies with well-defined levels
Implicit Hierarchy
- Relationships are inferred dynamically based on data behavior, shared identifiers, feature similarity (like embeddings), or through the chaining of retrievers across different collections.
- This is more flexible but requires careful design of features and retriever logic.
- It enables powerful inter-taxonomy relationships (e.g., linking Faces to Scenes), effectively creating joins across different types of data.

Use Cases:

Multi-Level Data Organization

Model complex, structured relationships like product catalogs (Electronics > TV & Home Theater > TVs > OLED TVs), organizational charts (Company > Department > Team > Employee), or scientific classifications (Biology > Zoology > Mammalia). Properties can be inherited down the hierarchy.Example 1: Product CategorizationCreate a product taxonomy to automatically classify products and inherit category properties:

# Create product category taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="product_hierarchy",
    description="Product category hierarchy",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_product_matcher"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_categories",
                "parent_collection_id": None,
                "enrichment_fields": ["category", "tax_group"]
            },
            {
                "collection_id": "col_subcategories",
                "parent_collection_id": "col_categories",
                "enrichment_fields": ["subcategory", "warranty_policy"]
            },
            {
                "collection_id": "col_product_types",
                "parent_collection_id": "col_subcategories",
                "enrichment_fields": ["product_type", "return_window"]
            }
        ]
    }
)

This creates a three-level product hierarchy where each level inherits properties from its parent.Example 2: Personnel Recognition & HierarchyCreate a hierarchical taxonomy for recognizing faces and applying appropriate metadata based on roles within an organization:

# Create face recognition taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="personnel_faces",
    description="Personnel face recognition",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_face_matcher"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_people",
                "parent_collection_id": None,
                "enrichment_fields": ["name", "id_verified"]
            },
            {
                "collection_id": "col_employees",
                "parent_collection_id": "col_people",
                "enrichment_fields": ["employee_id", "department", "title"]
            },
            {
                "collection_id": "col_visitors",
                "parent_collection_id": "col_people",
                "enrichment_fields": ["visitor_type", "access_level"]
            }
        ]
    }
)

This taxonomy can be used to automatically identify and tag people in images or videos based on their organizational role.

Linking Disparate Datasets

Creating a Taxonomy

Flat Taxonomy

from mixpeek import Mixpeek

mp = Mixpeek(api_key="YOUR_API_KEY")

# Create a flat taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="product_categories",
    description="Product category classification",
    taxonomy_type="flat",
    retriever={
        "retriever_id": "ret_def456",  # Existing retriever for matching
        "threshold": 0.7               # Similarity threshold
    },
    source_collections=[
        {
            "collection_id": "col_categories",
            "enrichment_fields": ["category", "department", "tax_rate"]
        }
    ]
)

taxonomy_id = taxonomy["taxonomy_id"]
print(f"Created taxonomy: {taxonomy_id}")

Hierarchical Taxonomy

# Create a hierarchical taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="organization_structure",
    description="Company organizational structure",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_ghi789"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_people",
                "parent_collection_id": None,  # Top-level collection
                "enrichment_fields": ["name", "basic_access"]
            },
            {
                "collection_id": "col_employees",
                "parent_collection_id": "col_people",  # Child of people
                "enrichment_fields": ["employee_id", "department"]
            },
            {
                "collection_id": "col_executives",
                "parent_collection_id": "col_employees",  # Child of employees
                "enrichment_fields": ["executive_level", "budget_authority"]
            }
        ]
    }
)

Applying Taxonomies

Once you’ve created a taxonomy, you can apply it to enrich documents in your collections.

Materialization Options

When applying taxonomies, you have two main materialization options:

Materialized

Creates enriched documents in a specified output collectionBenefits:

Faster query performance
Pre-computed enrichments
Historical enrichment tracking

Considerations:

Requires additional storage
Needs re-application when taxonomy changes

On-Demand

Computes enrichments during query executionBenefits:

Always uses latest taxonomy
No duplicate storage required
Automatic updates with taxonomy changes

Considerations:

Higher query-time compute costs
Potentially slower query performance
Cannot track historical enrichment changes

Taxonomy Node Structure

In a hierarchical taxonomy, nodes are documents in collections with specific structures:

// Example: Executive node in a hierarchical taxonomy
{
  "document_id": "doc_abc123",
  "collection_id": "col_executives",
  // Person properties (inherited from col_people)
  "name": "Jane Smith",
  "basic_access": true,
  
  // Employee properties (inherited from col_employees)
  "employee_id": "E12345",
  "department": "Marketing",
  
  // Executive-specific properties
  "executive_level": "VP",
  "budget_authority": 5000000,
  
  // Embedding for matching
  "face_embedding": [0.1, 0.2, 0.3, ...]
}

Property Inheritance

In hierarchical taxonomies, properties are inherited from parent to child nodes:

Child nodes inherit all properties from their parent nodes
Child nodes can override inherited properties with their own values
Inheritance follows the collection hierarchy defined in the taxonomy
While explicit hierarchies use defined parent-child links for inheritance, implicit relationships facilitate a similar flow of enrichment data between linked documents identified during the retrieval process (e.g., a Scene document inheriting details from a matched Face document).

Enrichment and Retrieval Processes

The way taxonomies populate data (Enrichment) and how that populated data is subsequently used for filtering and searching (Retrieval) differs significantly based on whether the relationships are explicit (intra-taxonomy trees) or implicit (inter-taxonomy links or dynamic connections).

Explicit Hierarchy Enrichment & Retrieval

Implicit Relationship Enrichment & Retrieval

Comparison Table

Feature	Explicit Hierarchy (Intra-Taxonomy)	Implicit Relationships (Inter-Taxonomy / Dynamic)
Defined by	Schema (`parent`, structure)	Data behavior, shared IDs, retriever chaining
Enrichment	Rule-based, tree traversal	Dynamic, input-driven enrichment chains
Retrieval	`IN subtree("X")`, static filters	Chained stages (e.g., Face → Scene → Cluster)
Query Composition	Single query, expanded filters	Multi-step query pipeline, composite queries
Analogy	Folder structure / Tree	SQL JOINs / Graph Traversal
Examples	Category trees, org charts	Face → Scene → Episode links, Product Mentions

Using Enriched Documents

Once you’ve applied a taxonomy, you can use the enriched fields in searches and filters:

# Search using enriched taxonomy fields
results = mp.retrievers.execute(
    retriever_id="ret_jkl012",
    query={
        "text": "marketing presentation"
    },
    filters={
        "department": "Marketing",  # Taxonomy-enriched field
        "executive_level": {"$exists": True}  # Only match executive content
    }
)

Example Use Cases

Create a product taxonomy to automatically classify products and inherit category properties:

# Create product category taxonomy
taxonomy = mp.taxonomies.create(
    namespace_id="ns_abc123",
    name="product_hierarchy",
    description="Product category hierarchy",
    taxonomy_type="hierarchical",
    retriever={
        "retriever_id": "ret_product_matcher"
    },
    hierarchical_config={
        "collection_nodes": [
            {
                "collection_id": "col_categories",
                "parent_collection_id": None,
                "enrichment_fields": ["category", "tax_group"]
            },
            {
                "collection_id": "col_subcategories",
                "parent_collection_id": "col_categories",
                "enrichment_fields": ["subcategory", "warranty_policy"]
            },
            {
                "collection_id": "col_product_types",
                "parent_collection_id": "col_subcategories",
                "enrichment_fields": ["product_type", "return_window"]
            }
        ]
    }
)

This creates a three-level product hierarchy where each level inherits properties from its parent.

Best Practices

Design Thoughtful Hierarchies

For hierarchical taxonomies, carefully plan the levels and inheritance relationships to avoid redundancy and maximize usability.

Optimize Retriever Performance

Create efficient retrievers for taxonomy matching, focusing on the most distinctive features for each node type.

Consider Materialization Strategy

Choose between materialized and on-demand enrichment based on your query patterns, update frequency, and storage constraints.

Test with Representative Data

Validate taxonomy performance with representative test data before applying to your full collection.

Taxonomies with many nodes or complex hierarchies can impact application performance. Optimize your taxonomy structure and retriever configuration for the best balance of accuracy and performance.

Implementation Patterns

Dynamic Classification

Apply taxonomies to automatically classify new content as it’s ingested, using a pipeline hook to trigger taxonomy application.

Enriched Search

Use taxonomies to enrich documents with additional metadata that can be leveraged for more precise filtering and faceting in search.

Hierarchical Navigation

Create user interfaces that leverage hierarchical taxonomy structures for browsing and navigating content collections.

Compliance Tagging

Use taxonomies to automatically apply compliance or policy tags to content based on its characteristics.

API Reference

For complete details on working with taxonomies, see our Taxonomies API Reference.

🧩 Where Do Retriever Inputs Get Their Values?

There are two key sources for these inputs, depending on when the retriever is being executed:

1. During Ingestion (a.k.a. feature extraction time)

Here, values for retriever inputs come from:

Raw uploaded data (e.g., file contents, video frames, text)
Output of previous extractors (e.g., OCR → text → entity recognition)
Static metadata from the doc (e.g., file name, timestamp)

You often see this in the config like:

feature_extractors:
  - type: taxonomy_extraction
    taxonomy: "scene_taxonomy"

In this case, a retriever associated with the taxonomy might pull its input from:

{
  "text": "<output of OCR>",
  "image": "<scene frame>",
  ...
}

2. During Query / Retrieval Time

When someone is searching, retriever inputs can come from:

User query inputs (text, filters, embeddings)
Intermediate outputs in the retrieval pipeline (e.g., a face_id found from a face query used as input into a scene_taxonomy retriever)
Query pipelines can be pre-defined to wire these together dynamically, as in:

{
  "face_input": "img_of_brad_pitt.jpg",
  "scene_query": "fighting scene"
}

Then one retriever resolves the face to a canonical ID, which is used as a pre-filter in another retriever’s execution.

🔄 Enrichment on Retrieval

When we say “fields get enriched on retrieval execution,” it means:

A taxonomy’s retriever will take the inputs and try to compute or resolve field values (like resolving "John Wick" → character_id)
This can happen “on demand” for a query result if the field wasn’t stored pre-computed

So the same retriever logic might be used both:

During ingestion (to store the enriched field)
During query (to enrich or filter results live)

🧠 Bonus Analogy

Think of a taxonomy collection as a smart join table:

Each row is an entity
The retrievers act like lookup functions that populate columns
The inputs to those functions come either from raw docs, previous retrievers, or user queries

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

Overview

Content Classification

Data Enrichment

Types of Taxonomies

Flat Taxonomies

Hierarchical Taxonomies

Creating a Taxonomy

Flat Taxonomy

Hierarchical Taxonomy

Applying Taxonomies

Materialization Options

Materialized

On-Demand

Taxonomy Node Structure

Property Inheritance

Enrichment and Retrieval Processes

Comparison Table

Using Enriched Documents

Example Use Cases

Best Practices

Implementation Patterns

Dynamic Classification

Enriched Search

Hierarchical Navigation

Compliance Tagging

API Reference

🧩 Where Do Retriever Inputs Get Their Values?

1. During Ingestion (a.k.a. feature extraction time)

2. During Query / Retrieval Time

🔄 Enrichment on Retrieval

🧠 Bonus Analogy

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

​Overview

Content Classification

Data Enrichment

​Types of Taxonomies

​Flat Taxonomies

​Hierarchical Taxonomies

​Creating a Taxonomy

​Flat Taxonomy

​Hierarchical Taxonomy

​Applying Taxonomies

​Materialization Options

Materialized

On-Demand

​Taxonomy Node Structure

​Property Inheritance

​Enrichment and Retrieval Processes

​Comparison Table

​Using Enriched Documents

​Example Use Cases

​Best Practices

​Implementation Patterns

Dynamic Classification

Enriched Search

Hierarchical Navigation

Compliance Tagging

​API Reference

​🧩 Where Do Retriever Inputs Get Their Values?

​1. During Ingestion (a.k.a. feature extraction time)

​2. During Query / Retrieval Time

​🔄 Enrichment on Retrieval

​🧠 Bonus Analogy

Overview

Types of Taxonomies

Flat Taxonomies

Hierarchical Taxonomies

Creating a Taxonomy

Flat Taxonomy

Hierarchical Taxonomy

Applying Taxonomies

Materialization Options

Taxonomy Node Structure

Property Inheritance

Enrichment and Retrieval Processes

Comparison Table

Using Enriched Documents

Example Use Cases

Best Practices

Implementation Patterns

API Reference

🧩 Where Do Retriever Inputs Get Their Values?

1. During Ingestion (a.k.a. feature extraction time)

2. During Query / Retrieval Time

🔄 Enrichment on Retrieval

🧠 Bonus Analogy