Clusters
Clustering in Mixpeek serves as the multimodal equivalent of SQL GROUP BY operations, allowing you to group similar documents together based on feature similarity rather than exact field matches.
Overview
Clustering enables you to organize and group documents based on their feature similarity. Unlike traditional SQL GROUP BY operations that group rows based on exact field matches, clustering uses similarity metrics to group documents that share similar characteristics.
Select Clustering Approach
Choose between two clustering methods:
Vector-Based Clustering Groups documents based on semantic similarity using embedding vectors and clustering algorithms like HDBSCAN or K-Means.
Attribute-Based Grouping Groups documents based on specific metadata attributes like categories, dates, or custom fields.
Configure Options
Set specific parameters for your chosen approach. For vector-based clustering, select the embedding model and algorithm (e.g., HDBSCAN). For attribute-based, select the fields to group by.
Configure Cluster Naming
Set up automatic cluster naming:
Automatic Naming
- Enable: Yes/No
- Generative Model: GPT-4o
- Method: Centroid
Select Target Collection
Choose which collection the cluster information will be added to. This determines which documents will be grouped together.
Set Execution Schedule
Determine when clustering will run:
- One-time execution
- Scheduled at a defined cadence (daily, weekly, etc.)
- Trigger-based (when new documents are added)
Clustering Types
Mixpeek supports various clustering approaches through the Grouper interface:
Vector Clustering
Groups documents based on embedding similarity using algorithms like K-means or DBSCAN. Perfect for finding visually or semantically similar content.
Time-based Clustering
Groups documents based on temporal proximity, such as grouping video scenes by time ranges or documents by creation date.
Categorical Clustering
Groups documents based on detected categories, objects, or topics. Useful for organizing content by subject matter.
Custom Clustering
Supports custom grouping criteria through the Grouper interface for domain-specific clustering needs.
Implementation Details
Dimensional Reduction and Visualization
Before applying clustering algorithms, high-dimensional feature vectors (like embeddings) are reduced to lower dimensions using techniques like UMAP or t-SNE. This process serves two important purposes:
- Improves Clustering Performance: Reduces computational complexity and often leads to better cluster separation
- Enables Visual Inspection: Allows you to visualize feature relationships in 2D or 3D space
t-SNE visualization showing document clusters by content type and category
This visualization step is crucial for:
- Validating clustering quality before applying
- Identifying outliers or unexpected groupings
- Determining optimal cluster parameters
- Communicating patterns to stakeholders
Clustering Process
Feature Selection
Choose which features to use for clustering (e.g., embeddings, timestamps, categories)
Dimensional Reduction
Apply UMAP or t-SNE to reduce high-dimensional features to 2D or 3D for visualization and improved clustering
Algorithm Configuration
Configure clustering parameters (e.g., number of clusters, similarity thresholds)
Grouping Execution
Apply the clustering algorithm to group similar documents
Result Storage
Store cluster assignments and metadata for future retrieval
Use Cases
Content Organization
Automatically organize large collections of documents into logical groups
Similar Content Discovery
Find related content by exploring documents within the same cluster
Batch Processing
Process similar documents together for efficiency
Analytics
Analyze patterns and trends within document clusters
Best Practices
Limitations
- Document Threshold: Maximum of 100,000 documents per clustering operation
- Algorithm Constraints: K-means limited to 1,000 clusters; HDBSCAN limited to 50,000 documents
- Visualization Limits: t-SNE and UMAP visualizations limited to 10,000 points for performance reasons
- Reclustering Frequency: Scheduled clustering limited to once per hour at minimum intervals
- Feature Compatibility: Not all feature types can be used for clustering (e.g., some binary features)
- Processing Time: Large-scale clustering operations may take several minutes to complete
- Cluster Storage: Cluster assignments persist for 90 days by default before requiring refresh
- Naming Generation: Automatic cluster naming limited to 50 characters per cluster label
Was this page helpful?