Clustering groups similar documents to power discovery, analytics visuals, and taxonomy bootstrapping. Mixpeek runs clustering as an engine pipeline, stores artifacts (Parquet) for scale, and exposes APIs to create, execute, list, stream, and apply enrichments.

Overview

Structure Discovery

Find groups in one or more collections; optional hierarchical metadata

Artifacts & Scale

Results saved as Parquet (centroids, members) for WebGL/Arrow pipelines

LLM Labeling

Optionally name clusters, summaries, and keywords

Enrichment

Write back cluster_id membership or create derived collections

How it works

1

Preprocess

Optional normalization and dimensionality reduction (UMAP, tSNE, etc.)
2

Cluster

Algorithms like KMeans, DBSCAN, HDBSCAN, Agglomerative, Spectral, GMM, Mean Shift, OPTICS
3

Postprocess

Centroids and stats; optional LLM labeling and hierarchical metadata
4

Persist

Parquet artifacts saved per run_id; list and stream via API

Create a cluster definition

  • API: Create Cluster
  • Method: POST
  • Path: /v1/clusters
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_ids": ["col_products_v1"],
    "cluster_name": "products_clip_hdbscan",
    "cluster_type": "vector",
    "vector_config": {
      "feature_extractor_name": "clip_vit_l_14",
      "clustering_method": "hdbscan",
      "hdbscan_parameters": {"min_cluster_size": 10, "min_samples": 5}
    },
    "llm_labeling": {"enabled": true, "model_name": "gpt-4"}
  }'

Execute clustering

  • API: Execute Clustering
  • Method: POST
  • Path: /v1/clusters/execute
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_ids": ["col_products_v1"],
    "config": {
      "algorithm": "kmeans",
      "algorithm_params": {"n_clusters": 8, "max_iter": 300},
      "feature_vector": {"feature_address": {"extractor": "clip_vit_l_14", "version": "1.0.0"}},
      "normalize_features": true,
      "dimensionality_reduction": {"method": "umap", "n_components": 2}
    },
    "sample_size": 10000,
    "store_results": true,
    "include_members": false,
    "compute_metrics": true,
    "save_artifacts": true
  }'
Response includes run_id, metrics, and centroid summaries. Artifacts are written under a per‑run S3 prefix.

Stream artifacts for UI

  • API: Stream Cluster Data
  • Method: POST
  • Path: /v1/clusters/{cluster_id}/data
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/CLUSTER_ID/data \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_id": "cl_123",
    "include_centroids": true,
    "include_members": true,
    "limit": 1000,
    "offset": 0
  }'
Use this to load centroids and members for visualizations (2D/3D reductions, partitions by cluster_id).

Apply cluster enrichment

  • API: Apply Cluster Enrichment
  • Method: POST
  • Path: /v1/clusters/enrich
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/enrich \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "clustering_ids": ["cl_run_abc"],
    "source_collection_id": "col_products_v1",
    "target_collection_id": "col_products_enriched_v1",
    "batch_size": 1000,
    "parallelism": 4
  }'
This writes cluster_id membership (and optional labels) back to documents.

Manage clusters

Config building blocks

kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, optics

Artifacts (Parquet)

Best practices

1

Start with samples

Use sample_size for quick exploration before full runs
2

Label pragmatically

Enable LLM labeling after validating cluster separation
3

Persist visuals

Save artifacts and stream reduced coordinates for UI at scale
4

Close the loop

Convert stable clusters into taxonomies to enrich downstream search

See also