Clustering groups similar documents to power discovery, analytics visuals, and taxonomy bootstrapping. Mixpeek runs clustering as an engine pipeline, stores artifacts (Parquet) for scale, and exposes APIs to create, execute, list, stream, and apply enrichments.
Overview
Structure Discovery
Find groups in one or more collections; optional hierarchical metadata
Artifacts & Scale
Results saved as Parquet (centroids, members) for WebGL/Arrow pipelines
LLM Labeling
Optionally name clusters, summaries, and keywords
Enrichment
Write back cluster_id membership or create derived collections
How it works
1
Preprocess
Optional normalization and dimensionality reduction (UMAP, tSNE, etc.)
2
Cluster
Algorithms like KMeans, DBSCAN, HDBSCAN, Agglomerative, Spectral, GMM, Mean Shift, OPTICS
3
Postprocess
Centroids and stats; optional LLM labeling and hierarchical metadata
4
Persist
Parquet artifacts saved per run_id; list and stream via API
Multimodal example
Create a cluster definition
- API: Create Cluster
- Method: POST
- Path:
/v1/clusters
- Reference: API Reference
Execute clustering
- API: Execute Clustering
- Method: POST
- Path:
/v1/clusters/execute
- Reference: API Reference
run_id
, metrics, and centroid summaries. Artifacts are written under a per‑run S3 prefix.
Stream artifacts for UI
- API: Stream Cluster Data
- Method: POST
- Path:
/v1/clusters/{cluster_id}/data
- Reference: API Reference
Apply cluster enrichment
- API: Apply Cluster Enrichment
- Method: POST
- Path:
/v1/clusters/enrich
- Reference: API Reference
cluster_id
membership (and optional labels) back to documents.
Manage clusters
Config building blocks
- Algorithms
- Reduction & Normalization
- LLM Labeling
- Hierarchy
kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, optics
Artifacts (Parquet)
Centroids dataset
Centroids dataset
Columns include:
cluster_id
, centroid_vector
, num_members
, variance
, label
, summary
, keywords
, feature_name
, feature_dimensions
, parent_cluster_id
, hierarchy_level
, reduction_method
, parameters
, algorithm
, run_id
, timestampsMembers dataset
Members dataset
Partitioned by
cluster_id
; includes point_id
, reduced coordinates (x
, y
, optional z
), and optional payload slice for filteringBest practices
1
Start with samples
Use
sample_size
for quick exploration before full runs2
Label pragmatically
Enable LLM labeling after validating cluster separation
3
Persist visuals
Save artifacts and stream reduced coordinates for UI at scale
4
Close the loop
Convert stable clusters into taxonomies to enrich downstream search