Clustering in Mixpeek serves as the multimodal equivalent of SQL GROUP BY operations, allowing you to group similar documents together based on feature similarity rather than exact field matches.

Overview

Clustering enables you to organize and group documents based on their feature similarity. Unlike traditional SQL GROUP BY operations that group rows based on exact field matches, clustering uses similarity metrics to group documents that share similar characteristics.

1

Select Clustering Approach

Choose between two clustering methods:

Vector-Based Clustering Groups documents based on semantic similarity using embedding vectors and clustering algorithms like HDBSCAN or K-Means.

Attribute-Based Grouping Groups documents based on specific metadata attributes like categories, dates, or custom fields.

2

Configure Options

Set specific parameters for your chosen approach. For vector-based clustering, select the embedding model and algorithm (e.g., HDBSCAN). For attribute-based, select the fields to group by.

3

Configure Cluster Naming

Set up automatic cluster naming:

Automatic Naming

  • Enable: Yes/No
  • Generative Model: GPT-4o
  • Method: Centroid
4

Select Target Collection

Choose which collection the cluster information will be added to. This determines which documents will be grouped together.

5

Set Execution Schedule

Determine when clustering will run:

  • One-time execution
  • Scheduled at a defined cadence (daily, weekly, etc.)
  • Trigger-based (when new documents are added)

Clustering Types

Mixpeek supports various clustering approaches through the Grouper interface:

Vector Clustering

Groups documents based on embedding similarity using algorithms like K-means or DBSCAN. Perfect for finding visually or semantically similar content.

Time-based Clustering

Groups documents based on temporal proximity, such as grouping video scenes by time ranges or documents by creation date.

Categorical Clustering

Groups documents based on detected categories, objects, or topics. Useful for organizing content by subject matter.

Custom Clustering

Supports custom grouping criteria through the Grouper interface for domain-specific clustering needs.

Implementation Details

Dimensional Reduction and Visualization

Before applying clustering algorithms, high-dimensional feature vectors (like embeddings) are reduced to lower dimensions using techniques like UMAP or t-SNE. This process serves two important purposes:

  1. Improves Clustering Performance: Reduces computational complexity and often leads to better cluster separation
  2. Enables Visual Inspection: Allows you to visualize feature relationships in 2D or 3D space

t-SNE visualization showing document clusters by content type and category

This visualization step is crucial for:

  • Validating clustering quality before applying
  • Identifying outliers or unexpected groupings
  • Determining optimal cluster parameters
  • Communicating patterns to stakeholders

Clustering Process

1

Feature Selection

Choose which features to use for clustering (e.g., embeddings, timestamps, categories)

2

Dimensional Reduction

Apply UMAP or t-SNE to reduce high-dimensional features to 2D or 3D for visualization and improved clustering

3

Algorithm Configuration

Configure clustering parameters (e.g., number of clusters, similarity thresholds)

4

Grouping Execution

Apply the clustering algorithm to group similar documents

5

Result Storage

Store cluster assignments and metadata for future retrieval

Use Cases

Content Organization

Automatically organize large collections of documents into logical groups

Similar Content Discovery

Find related content by exploring documents within the same cluster

Batch Processing

Process similar documents together for efficiency

Analytics

Analyze patterns and trends within document clusters

Best Practices

Limitations

  • Document Threshold: Maximum of 100,000 documents per clustering operation
  • Algorithm Constraints: K-means limited to 1,000 clusters; HDBSCAN limited to 50,000 documents
  • Visualization Limits: t-SNE and UMAP visualizations limited to 10,000 points for performance reasons
  • Reclustering Frequency: Scheduled clustering limited to once per hour at minimum intervals
  • Feature Compatibility: Not all feature types can be used for clustering (e.g., some binary features)
  • Processing Time: Large-scale clustering operations may take several minutes to complete
  • Cluster Storage: Cluster assignments persist for 90 days by default before requiring refresh
  • Naming Generation: Automatic cluster naming limited to 50 characters per cluster label