Clusters

Clustering in Mixpeek serves as the multimodal equivalent of SQL GROUP BY operations, allowing you to group similar documents together based on feature similarity rather than exact field matches.

Overview

Clustering enables you to organize and group documents based on their feature similarity. Unlike traditional SQL GROUP BY operations that group rows based on exact field matches, clustering uses similarity metrics to group documents that share similar characteristics. Watch an Intro Video

Select Clustering Approach

Choose between two clustering methods:Vector-Based Clustering Groups documents based on semantic similarity using embedding vectors and clustering algorithms like HDBSCAN or K-Means.Attribute-Based Grouping Groups documents based on specific metadata attributes like categories, dates, or custom fields.

Configure Options

Set specific parameters for your chosen approach. For vector-based clustering, select the embedding model and algorithm (e.g., HDBSCAN). For attribute-based, select the fields to group by.

Configure Cluster Naming

Set up automatic cluster naming:Automatic Naming

Enable: Yes/No
Generative Model: GPT-4o
Method: Centroid

Select Target Collection

Choose which collection the cluster information will be added to. This determines which documents will be grouped together.

Set Execution Schedule

Determine when clustering will run:

One-time execution
Scheduled at a defined cadence (daily, weekly, etc.)
Trigger-based (when new documents are added)

Clustering Types

Mixpeek supports various clustering approaches through the Grouper interface:

Vector Clustering

Groups documents based on embedding similarity using algorithms like K-means or DBSCAN. Perfect for finding visually or semantically similar content.

Time-based Clustering

Groups documents based on temporal proximity, such as grouping video scenes by time ranges or documents by creation date.

Categorical Clustering

Groups documents based on detected categories, objects, or topics. Useful for organizing content by subject matter.

Custom Clustering

Supports custom grouping criteria through the Grouper interface for domain-specific clustering needs.

Implementation Details

Dimensional Reduction and Visualization

Before applying clustering algorithms, high-dimensional feature vectors (like embeddings) are reduced to lower dimensions using techniques like UMAP or t-SNE. This process serves two important purposes:

Improves Clustering Performance: Reduces computational complexity and often leads to better cluster separation
Enables Visual Inspection: Allows you to visualize feature relationships in 2D or 3D space

t-SNE visualization showing document clusters by content type and category

This visualization step is crucial for:

Validating clustering quality before applying
Identifying outliers or unexpected groupings
Determining optimal cluster parameters
Communicating patterns to stakeholders

Clustering Process

Feature Selection

Choose which features to use for clustering (e.g., embeddings, timestamps, categories)

Dimensional Reduction

Apply UMAP or t-SNE to reduce high-dimensional features to 2D or 3D for visualization and improved clustering

Algorithm Configuration

Configure clustering parameters (e.g., number of clusters, similarity thresholds)

Grouping Execution

Apply the clustering algorithm to group similar documents

Result Storage

Store cluster assignments and metadata for future retrieval

Use Cases

Content Organization

Automatically organize large collections of documents into logical groups

Batch Processing

Process similar documents together for efficiency

Analytics

Analyze patterns and trends within document clusters

Best Practices

Feature Selection

Algorithm Choice

Performance Considerations

Limitations

Document Threshold: Maximum of 100,000 documents per clustering operation
Algorithm Constraints: K-means limited to 1,000 clusters; HDBSCAN limited to 50,000 documents
Visualization Limits: t-SNE and UMAP visualizations limited to 10,000 points for performance reasons
Reclustering Frequency: Scheduled clustering limited to once per hour at minimum intervals
Feature Compatibility: Not all feature types can be used for clustering (e.g., some binary features)
Processing Time: Large-scale clustering operations may take several minutes to complete
Cluster Storage: Cluster assignments persist for 90 days by default before requiring refresh
Naming Generation: Automatic cluster naming limited to 50 characters per cluster label

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

Overview

Clustering Types

Vector Clustering

Time-based Clustering

Categorical Clustering

Custom Clustering

Implementation Details

Dimensional Reduction and Visualization

Clustering Process

Use Cases

Content Organization

Similar Content Discovery

Batch Processing

Analytics

Best Practices

Limitations

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

​Overview

​Clustering Types

Vector Clustering

Time-based Clustering

Categorical Clustering

Custom Clustering

​Implementation Details

​Dimensional Reduction and Visualization

​Clustering Process

​Use Cases

Content Organization

Similar Content Discovery

Batch Processing

Analytics

​Best Practices

​Limitations

Overview

Clustering Types

Implementation Details

Dimensional Reduction and Visualization

Clustering Process

Use Cases

Best Practices

Limitations