Clustering groups similar documents to power discovery, analytics visuals, and taxonomy bootstrapping. Mixpeek runs clustering as an engine pipeline, stores artifacts (Parquet) for scale, and exposes APIs to create, execute, list, stream, and apply enrichments. New: Automated triggers let you schedule clustering with cron expressions, intervals, events, or conditions—keeping your clusters fresh without manual intervention.
Overview
Structure Discovery Find groups in one or more collections; optional hierarchical metadata
Scalable Visualization Landmark-based UMAP generates 2D/3D coords for millions of points
Artifacts & Scale Results saved as Parquet (centroids, members) for WebGL/Arrow pipelines
LLM Labeling Optionally name clusters, summaries, and keywords
Enrichment Write back cluster_id membership or create derived collections
Automated Triggers Schedule clustering with cron, intervals, events, or conditions
Event-Driven Auto-recluster when documents added or data changes significantly
Distributed Processing Ray map_batches processes millions of points with bounded memory
How it works
Trigger or Execute
Start clustering via manual API call or automated trigger (cron, interval, event, conditional)
Preprocess
Optional normalization and dimensionality reduction (UMAP, tSNE, PCA)
Cluster
Algorithms like KMeans, DBSCAN, HDBSCAN, Agglomerative, Spectral, GMM, Mean Shift, OPTICS
Postprocess
Compute centroids and stats; optional LLM labeling and hierarchical metadata
Persist & Notify
Parquet artifacts saved per run_id; webhook events emitted for monitoring
Enrich (optional)
Write cluster_id membership and labels back to documents
Multimodal example
Create a cluster definition
API : Create Cluster
Method : POST
Path : /v1/clusters
Reference : API Reference
curl -X POST https://api.mixpeek.com/v1/clusters \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"collection_ids": ["col_products_v1"],
"cluster_name": "products_clip_hdbscan",
"cluster_type": "vector",
"vector_config": {
"feature_extractor_name": "clip_vit_l_14",
"clustering_method": "hdbscan",
"hdbscan_parameters": {"min_cluster_size": 10, "min_samples": 5}
},
"llm_labeling": {"enabled": true, "model_name": "gpt-4"}
}'
Execute clustering
API : Execute Clustering
Method : POST
Path : /v1/clusters/execute
Reference : API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/execute \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"collection_ids": ["col_products_v1"],
"config": {
"algorithm": "kmeans",
"algorithm_params": {"n_clusters": 8, "max_iter": 300},
"feature_vector": {"feature_address": {"extractor": "clip_vit_l_14", "version": "1.0.0"}},
"normalize_features": true,
"dimensionality_reduction": {"method": "umap", "n_components": 2}
},
"sample_size": 10000,
"store_results": true,
"include_members": false,
"compute_metrics": true,
"save_artifacts": true
}'
Response includes run_id
, metrics, and centroid summaries. Artifacts are written under a per‑run S3 prefix.
Automated clustering with triggers
Instead of manually executing clustering, you can define triggers that automatically run clustering jobs based on schedules, events, or conditions. This ensures your clusters stay fresh without manual intervention.
Triggers are perfect for production workflows where data constantly changes—nightly reclustering, event-driven updates, or condition-based refreshes.
Why use triggers?
Stay Fresh Keep clusters up-to-date as new documents arrive
Hands-Off Set once, runs automatically—no manual execution needed
Resource Efficient Schedule during off-peak hours or when data changes significantly
Production Ready Built-in failure handling, webhooks, and execution history
Trigger types
Mixpeek supports four trigger types, each suited for different use cases:
Cron Triggers
Interval Triggers
Event Triggers
Conditional Triggers
Execute clustering at specific times using cron expressions. Perfect for:
Nightly reclustering at 2am
Weekly batch processing every Sunday
Month-end analytics
curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"execution_config": {
"collection_ids": ["col_products"],
"config": {
"algorithm": "kmeans",
"algorithm_params": {"n_clusters": 10}
}
},
"trigger_type": "cron",
"schedule_config": {
"cron_expression": "0 2 * * *",
"timezone": "America/New_York"
},
"description": "Nightly product clustering at 2am EST"
}'
Common cron expressions:
"0 2 * * *"
- Daily at 2:00am
"0 */6 * * *"
- Every 6 hours
"0 0 * * 0"
- Every Sunday at midnight
"30 14 1 * *"
- First day of month at 2:30pm
Managing triggers
Pause and resume triggers
Temporarily stop a trigger without deleting it: # Pause trigger
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/pause \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123"
# Resume trigger
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/resume \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123"
When resumed, the next execution time is recalculated from the current time.
Modify schedule configuration without recreating the trigger: curl -X PATCH https://api.mixpeek.com/v1/clusters/triggers/trig_abc123 \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"schedule_config": {
"cron_expression": "0 3 * * *"
},
"description": "Updated to 3am"
}'
Note: Trigger type is immutable—delete and recreate to change type.
Track all executions of a trigger with detailed metrics: curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/history \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"offset": 0,
"limit": 50,
"status_filter": "completed"
}'
Response includes:
Job IDs and execution times
Status (completed/failed)
Execution duration
Cluster count and document count
Find triggers by cluster, type, or status: curl -X POST https://api.mixpeek.com/v1/clusters/triggers/list \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"cluster_id": "cl_xyz789",
"trigger_type": "cron",
"status": "active",
"offset": 0,
"limit": 50
}'
Trigger lifecycle and status
Active
Trigger is enabled and will fire according to schedule. This is the normal operating state.
Paused
Trigger is temporarily disabled but retains configuration. Use pause endpoint to enter this state.
Failed
Trigger automatically disabled after 5 consecutive failures. Requires manual resume after fixing issues.
Disabled
Trigger soft-deleted via DELETE endpoint. No longer executes but history is preserved.
Failure handling and recovery
Triggers include built-in resilience:
Single failures: Logged but trigger continues
Consecutive failures: Tracked in trigger metadata
5 consecutive failures: Trigger status changes to failed
, requires manual resume
Webhook notifications: Sent for each failure with error details
Recovery steps:
Check last_execution_error
field via GET endpoint
Fix underlying issue (e.g., invalid config, missing resources)
Update trigger if needed with PATCH endpoint
Resume trigger with POST to /resume
endpoint
Webhook events
Triggers emit lifecycle events you can subscribe to:
trigger.created
- New trigger created
trigger.fired
- Trigger fired and created clustering job
trigger.execution.completed
- Clustering job completed successfully
trigger.execution.failed
- Clustering job failed
trigger.paused
/ trigger.resumed
- State changes
trigger.deleted
- Trigger removed
Subscribe via the webhooks API to build custom workflows.
Example: Complete automation workflow
Here’s a production-ready setup for a product catalog:
Create nightly reclustering
curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_products" \
-H "Content-Type: application/json" \
-d '{
"execution_config": {
"collection_ids": ["col_catalog"],
"config": {
"algorithm": "kmeans",
"algorithm_params": {"n_clusters": 20},
"normalize_features": true,
"llm_labeling": {"enabled": true}
}
},
"trigger_type": "cron",
"schedule_config": {
"cron_expression": "0 2 * * *",
"timezone": "America/New_York"
},
"description": "Daily product reclustering with labels"
}'
Add event-based trigger for rapid changes
curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_products" \
-H "Content-Type: application/json" \
-d '{
"execution_config": {
"collection_ids": ["col_catalog"],
"config": {"algorithm": "hdbscan"}
},
"trigger_type": "event",
"schedule_config": {
"event_type": "documents_added",
"event_threshold": 500,
"cooldown_seconds": 3600
},
"description": "Quick recluster after 500 new products"
}'
Monitor with webhooks
Subscribe to trigger.execution.completed
events to:
Track clustering performance over time
Alert on cluster count changes
Trigger downstream enrichment pipelines
Apply enrichment automatically
When trigger completes, use enrichment API to write cluster_id
back to documents for filtering and faceting.
Quotas and limits
Max active triggers: 50 per namespace
Min interval: 300 seconds (5 minutes)
Default cooldown: 300 seconds (5 minutes)
Polling interval: 60 seconds
Max consecutive failures: 5 (auto-disables trigger)
Stream artifacts for UI
API : Stream Cluster Data
Method : POST
Path : /v1/clusters/{cluster_id}/data
Reference : API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/CLUSTER_ID/data \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"cluster_id": "cl_123",
"include_centroids": true,
"include_members": true,
"limit": 1000,
"offset": 0
}'
Use this to load centroids and members for visualizations (2D/3D reductions, partitions by cluster_id).
Scalable visualization with dimensionality reduction
For large clusters, Mixpeek provides engine-powered visualization that generates 2D/3D coordinates for millions of points using landmark-based dimensionality reduction. This enables interactive scatter plots, WebGL renderers, and spatial exploration at scale.
Visualization uses landmark-based UMAP with Nyström approximation—fitting on 2% of points and interpolating the rest. This scales to millions of points while maintaining quality.
Why use engine visualization?
Scales to Millions Landmark-based DR processes 1M+ points in minutes
Distributed Processing Ray map_batches processes 5k points per batch across cluster
Cached & Reusable S3-cached coordinates with signature-based keys—no redundant computation
Production Ready Hard cap at 10k forces engine delegation for safety
How it works
Request visualization
Client requests coordinates via API. If cluster has >10k points, automatically delegates to engine.
Compute signature
Engine generates cache key from cluster ID, method, parameters—ensures idempotence.
Check cache
If coordinates already exist in S3 with matching signature, skip computation and return cached URLs.
Fit on landmarks
Sample 2% of points (capped at 50k) as landmarks. Fit UMAP or incremental PCA on landmarks only.
Transform with Ray
Use Ray map_batches to apply dimensionality reduction to all points in 5k batches—distributed and memory-efficient.
Cache and return
Save coordinates to S3 as Parquet. Return URLs to API, which loads and serves to client.
Generate visualization
For datasets with >10k points, visualization is generated automatically when you request artifacts. For smaller datasets, call the engine endpoint directly.
Automatic (>10k points)
Manual (<10k points)
# For large clusters, visualization is automatic
curl https://api.mixpeek.com/v1/clusters/cl_abc123/artifacts?include_members= true \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123"
Response includes coordinates: {
"centroids" : [ ... ],
"members" : [
{
"point_id" : "doc_1" ,
"cluster_id" : "cluster_0" ,
"x" : 1.5 ,
"y" : 2.3 ,
"payload" : {}
}
]
}
Configuration options
Dimensionality reduction methods
UMAP (Uniform Manifold Approximation and Projection)
Best for: Preserving both local and global structure
Quality: Excellent cluster separation
Speed: Moderate (60s for 1M points)
Use when: Visualization quality matters most
Incremental PCA
Best for: Fast linear projections
Quality: Good for high-dimensional data
Speed: Fast (30s for 1M points)
Use when: Speed matters more than non-linear structure
{
"method" : "umap" , // or "ipca"
"n_components" : 2 // 2D or 3D
}
Controls how many points are used as landmarks for fitting:
sample_pct
: Percentage of points to use as landmarks (default: 0.02 = 2%)
max_landmarks
: Cap on landmark count (default: 50,000)
k_landmarks
: Nearest landmarks for interpolation (default: 15)
Examples:
10k points → 200 landmarks (2%)
100k points → 2k landmarks (2%)
1M points → 20k landmarks (2%)
10M points → 50k landmarks (capped)
{
"sample_pct" : 0.02 ,
"max_landmarks" : 50000 ,
"k_landmarks" : 15
}
Tuning guide:
More landmarks = better quality, slower processing
Fewer landmarks = faster processing, lower quality
Sweet spot: 0.02 (2%) for most use cases
Fine-tune UMAP behavior for different visualization needs:
n_neighbors
: Balance local vs global structure (default: 15)
min_dist
: Minimum distance between points (default: 0.1)
metric
: Distance metric (default: “cosine”)
Common configurations: Tight clusters (detail view): {
"umap_n_neighbors" : 30 ,
"umap_min_dist" : 0.01 ,
"umap_metric" : "cosine"
}
Spread out (overview): {
"umap_n_neighbors" : 10 ,
"umap_min_dist" : 0.3 ,
"umap_metric" : "euclidean"
}
High-dimensional embeddings: {
"umap_n_neighbors" : 20 ,
"umap_min_dist" : 0.1 ,
"umap_metric" : "cosine" // Best for embeddings
}
Caching and recomputation
Visualization results are cached in S3 with signature-based keys: Signature includes:
Cluster ID
Feature name
DR method and parameters
Component count
Behavior:
Same parameters → same signature → cached result
Different parameters → new signature → new computation
Set force_recompute: true
to bypass cache
{
"force_recompute" : false // Use cached if available
}
Cache location: s3://bucket/{internal_id}/{namespace_id}/
engine_cluster_build/{cluster_id}/
visualization/
coords_{signature}.parquet
reducer_{signature}.pkl
Processing Times
Memory Usage
Quality Metrics
Approximate times for landmark-based UMAP on typical hardware: Dataset Size Landmarks Fit Time Transform Time Total 1k points 100 2s 0.5s 2.5s 10k points 500 5s 2s 7s 100k points 2k 15s 10s 25s 1M points 20k 60s 60s 2m 10M points 50k 120s 300s 7m
Notes:
Fit time scales with landmark count squared
Transform time scales with point count × k_landmarks
Times assume 16-core machine
WebGL Scatter Plots Use Three.js, Plotly, or deck.gl to render millions of points with GPU acceleration
Observable Plots Load Parquet directly in Observable notebooks for interactive exploration
Apache Arrow Use PyArrow or Arrow.js to stream coordinates without full deserialization
Tile-Based Rendering Bin coordinates into tiles for progressive loading in zoom-enabled UIs
Example: Load in JavaScript
import { tableFromIPC } from 'apache-arrow' ;
// Fetch coordinates from API
const response = await fetch ( '/v1/clusters/cl_abc/artifacts?include_members=true' );
const { members } = await response . json ();
// Render with Three.js, Plotly, etc.
const scene = new THREE . Scene ();
members . forEach ( point => {
const geometry = new THREE . SphereGeometry ( 0.05 );
const material = new THREE . MeshBasicMaterial ({
color: getColorForCluster ( point . cluster_id )
});
const sphere = new THREE . Mesh ( geometry , material );
sphere . position . set ( point . x , point . y , point . z || 0 );
scene . add ( sphere );
});
Troubleshooting
Dataset has X points. Client should request visualization generation via engine.
Cause: Cluster has fewer than 10k points, no pre-generated visualization.Solution: # Generate visualization first
POST /clusters/visualization
{
"cluster_id" : "cl_abc123",
"method" : "umap",
"n_components" : 2
}
# Then request artifacts
GET /clusters/cl_abc123/artifacts?include_members= true
Visualization quality is poor
Symptoms: Clusters overlap, structure is unclear, points are too spread out.Solutions:
Increase landmarks: Set sample_pct: 0.03
or 0.05
Adjust n_neighbors: Try 20-30
for more global structure
Tighten clusters: Set min_dist: 0.01
for less overlap
Check metric: Use "cosine"
for embeddings, "euclidean"
for raw features
Try 3D: Set n_components: 3
for complex structures
Symptoms: Visualization takes too long for interactive use.Solutions:
Reduce landmarks: Set sample_pct: 0.01
(1%)
Use IPCA: Set method: "ipca"
for 2x speedup
Check Ray cluster: Ensure sufficient CPUs allocated
Verify caching: Ensure force_recompute: false
to use cache
Increase batch size: Larger batches = fewer overhead (tune in engine config)
Engine visualization failed
Symptoms: Engine returns error during visualization generation.Debugging steps:
Check engine logs for detailed error message
Verify members.parquet
exists in S3 for the cluster
Try with force_recompute: true
to bypass cache
Check Ray cluster health and resource availability
Verify feature vectors are present in member data
Common causes:
Missing or corrupted member artifacts
Insufficient memory for landmark count
Ray actor failures (check Ray dashboard)
Apply cluster enrichment
API : Apply Cluster Enrichment
Method : POST
Path : /v1/clusters/enrich
Reference : API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/enrich \
-H "Authorization: Bearer $API_KEY " \
-H "X-Namespace: ns_123" \
-H "Content-Type: application/json" \
-d '{
"clustering_ids": ["cl_run_abc"],
"source_collection_id": "col_products_v1",
"target_collection_id": "col_products_enriched_v1",
"batch_size": 1000,
"parallelism": 4
}'
This writes cluster_id
membership (and optional labels) back to documents.
Manage clusters
Cluster Operations
Trigger Management
Config building blocks
kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, optics
Artifacts (Parquet)
Columns include: cluster_id
, centroid_vector
, num_members
, variance
, label
, summary
, keywords
, feature_name
, feature_dimensions
, parent_cluster_id
, hierarchy_level
, reduction_method
, parameters
, algorithm
, run_id
, timestamps
Partitioned by cluster_id
; includes point_id
, reduced coordinates (x
, y
, optional z
), and optional payload slice for filtering
Best practices
Start with samples
Use sample_size
for quick exploration before full runs. Test clustering algorithms and parameters on a subset before scaling up.
Automate with triggers
Set up cron triggers for nightly reclustering and event triggers for rapid data changes. This keeps clusters fresh without manual work.
Label pragmatically
Enable LLM labeling after validating cluster separation. Labels are expensive—ensure your clusters are meaningful first.
Monitor with webhooks
Subscribe to trigger execution events to track performance, alert on anomalies, and chain downstream workflows automatically.
Persist visuals
Save artifacts and stream reduced coordinates for UI at scale. Parquet format enables fast loading in WebGL and data viz tools.
Close the loop
Convert stable clusters into taxonomies to enrich downstream search. Apply enrichment to write cluster_id back for filtering and faceting.
See also