Workflow Overview
- Define cluster (
POST /v1/clusters) – choose source collections, feature URIs, algorithm, and optional labeling strategy. - Execute – run manually (
POST /v1/clusters/{id}/execute) or schedule via triggers (/v1/clusters/triggers/...). - Inspect artifacts – fetch centroids, members, or reduced coordinates (
/v1/clusters/{id}/artifacts). - Enrich documents – write
cluster_id, labels, and keywords back into collections. - Promote to taxonomy (optional) – convert stable clusters into reference nodes.
Configuration Highlights
| Setting | Description |
|---|---|
feature_addresses | One or more feature URIs to cluster on (dense, sparse, multi-vector). |
algorithm | kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, or optics. |
dimension_reduction | Optional UMAP / PCA for visualization coordinates. |
llm_labeling | Generate cluster labels, summaries, and keywords using configured LLM providers. |
hierarchical | Enable to compute parent-child cluster relationships. |
sample_size | Run on a subset before clustering the full dataset. |
Execution & Triggers
- Manual run:
POST /v1/clusters/{id}/execute - Submit asynchronous job:
POST /v1/clusters/{id}/execute/submit - Automated triggers: create cron, interval, or event-based triggers under
/v1/clusters/triggers. Execution history is accessible via trigger endpoints. - Every run yields a
run_id, exposes status via/v1/clusters/{id}/executions, and can be monitored through task polling.
Artifacts
| Artifact | Endpoint | Contents |
|---|---|---|
| Centroids | /executions/{run_id}/artifacts?include_centroids=true | Cluster ID, centroid vectors, counts, labels, summaries, keywords |
| Members | /executions/{run_id}/artifacts?include_members=true | Point IDs, reduced coordinates (x, y, z), cluster assignment |
| Streaming data | /executions/{run_id}/data | Stream centroids and members (Parquet-backed) for visualization |
Enrichment
Apply cluster membership back to collections:cluster_id (and optionally labels/summaries) into document payloads, enabling cluster-based filters and facets.
Monitoring & Management
GET /v1/clusters/{id}– inspect definition, latest run, enrichment status.POST /v1/clusters/list– search and filter cluster definitions.GET /v1/clusters/{id}/executions– view execution history and metrics.DELETE /v1/clusters/{id}– remove obsolete definitions (artifacts remain unless deleted separately).- Webhooks notify you when clustering jobs complete; integrate with alerting or automation.
Best Practices
- Prototype on samples – tune algorithm parameters using a small
sample_sizebefore running at scale. - Automate freshness – use triggers (cron or event-based) to keep clusters aligned with new data.
- Label efficiently – enable LLM labeling once clusters look coherent; store labels with confidence scores.
- Close the loop – evaluate clusters as candidate taxonomy nodes or enrichments for retrievers.
- Watch metrics – use execution statistics (duration, member counts) to detect drift or parameter issues.

