Skip to main content
Clusters group similar documents using configurable algorithms running on the Engine’s Ray workers. They produce reusable artifacts, optional enrichments, and can even feed new taxonomies.

Workflow Overview

  1. Define cluster (POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy.
  2. Execute – run manually (POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...).
  3. Inspect artifacts – fetch centroids, members, or reduced coordinates (/v1/clusters/{id}/artifacts).
  4. Enrich documents – write cluster_id, labels, and keywords back into collections.
  5. Promote to taxonomy (optional) – convert stable clusters into reference nodes.

Configuration Highlights

SettingDescription
feature_addressesOne or more feature URIs to cluster on (dense, sparse, multi-vector).
algorithmkmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics.
dimension_reductionOptional UMAP / PCA for visualization coordinates.
llm_labelingGenerate cluster labels, summaries, and keywords using configured LLM providers.
hierarchicalEnable to compute parent-child cluster relationships.
sample_sizeRun on a subset before clustering the full dataset.
Example definition:
curl -sS -X POST "$MP_API_URL/v1/clusters" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_name": "product_topics",
    "source_collection_ids": ["col_products"],
    "feature_addresses": ["mixpeek://text_extractor@v1/text_embedding"],
    "algorithm": "kmeans",
    "algorithm_config": { "num_clusters": 50 },
    "dimension_reduction": {
      "method": "umap",
      "components": 2
    },
    "llm_labeling": {
      "provider": "openai_chat_v1",
      "model": "gpt-4o-mini"
    }
  }'

Execution & Triggers

  • Manual run: POST /v1/clusters/{id}/execute
  • Submit asynchronous job: POST /v1/clusters/{id}/execute/submit
  • Automated triggers: create cron, interval, or event-based triggers under /v1/clusters/triggers. Execution history is accessible via trigger endpoints.
  • Every run yields a run_id, exposes status via /v1/clusters/{id}/executions, and can be monitored through task polling.

Artifacts

ArtifactEndpointContents
Centroids/executions/{run_id}/artifacts?include_centroids=trueCluster ID, centroid vectors, counts, labels, summaries, keywords
Members/executions/{run_id}/artifacts?include_members=truePoint IDs, reduced coordinates (x, y, z), cluster assignment
Streaming data/executions/{run_id}/dataStream centroids and members (Parquet-backed) for visualization
Artifacts are stored as Parquet in S3 for efficient downstream analytics and visualization.

Enrichment

Apply cluster membership back to collections:
curl -sS -X POST "$MP_API_URL/v1/clusters/{cluster_id}/enrich" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "run_id": "run_xyz789",
    "target_collection_id": "col_products_enriched",
    "fields": ["cluster_id", "label", "summary", "keywords"]
  }'
Enrichment writes cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.

Monitoring & Management

  • GET /v1/clusters/{id} – inspect definition, latest run, enrichment status.
  • POST /v1/clusters/list – search and filter cluster definitions.
  • GET /v1/clusters/{id}/executions – view execution history and metrics.
  • DELETE /v1/clusters/{id} – remove obsolete definitions (artifacts remain unless deleted separately).
  • Webhooks notify you when clustering jobs complete; integrate with alerting or automation.

Best Practices

  1. Prototype on samples – tune algorithm parameters using a small sample_size before running at scale.
  2. Automate freshness – use triggers (cron or event-based) to keep clusters aligned with new data.
  3. Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
  4. Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.
  5. Watch metrics – use execution statistics (duration, member counts) to detect drift or parameter issues.
Clustering is the bridge between raw embeddings and structured understanding. Use it to discover themes, power analytics, and bootstrap taxonomies that feed retrieval.