Create a new cluster configuration and output collection.
This endpoint:
The cluster can then be executed via POST /v1/clusters//execute
REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.
REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'
Create a clustering job for one or more collections.
Collections to cluster together
1Optional human-friendly name for the clustering job
Vector or attribute clustering
vector, attribute Required when cluster_type is 'vector'
{
"algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
"clustering_method": "hdbscan",
"description": "HDBSCAN clustering with multimodal embeddings",
"feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
"sample_size": 1000
}Required when cluster_type is 'attribute'
{
"attributes": ["category"],
"description": "Simple category clustering",
"hierarchical_grouping": false
}Optional filters to pre-filter documents before clustering (same format as list documents). Applied during Qdrant scroll before parquet export. Useful for clustering subsets like: status='active', category='electronics', etc.
Optional configuration for LLM-based cluster labeling. When provided with enabled=True, clusters will have semantic labels generated by LLM instead of generic labels like 'Cluster 0'. When not provided or enabled=False, uses fallback labels.
{
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
}If True, cluster results are written back to source collection(s) in-place instead of creating new output collections. Documents will be enriched with cluster_id, cluster_label, distance_to_centroid, and optionally other metadata. Similar to taxonomy enrichment pattern.
Configuration for source collection enrichment (only used if enrich_source_collection=True). Controls which fields are added to source documents and field naming conventions.
{
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
}Successful Response
Cluster job metadata stored in MongoDB clusters collection.
This is separate from cluster documents themselves. Tracks job-level configuration, status, and summary statistics.
Supports both vector and attribute clustering with appropriate metadata.
Human-readable cluster name
Namespace this cluster belongs to
Organization ID (internal_id)
Source collection IDs that were clustered
Type of clustering: vector (embedding-based) or attribute (metadata-based)
vector, attribute Unique cluster job identifier
Source bucket IDs that the input collections originated from. Enables bucket lineage tracking.
Optional filters that were applied to pre-filter documents before clustering
Feature URIs that were clustered (mixpeek://{extractor}@{version}/{output}). Only for vector clustering.
Strategy used if multiple features (concatenate/independent/weighted). Only for vector clustering.
Automatically learned feature weights (when multi_feature_strategy='weighted'). Keys are feature URIs, values are learned weights. Only populated after clustering execution completes.
Clustering quality score from weight learning (e.g., silhouette score). Only populated when multi_feature_strategy='weighted' and weights were learned.
Method for calculating cluster centroids (mean/median/medoid). Only for vector clustering.
Attribute field names that were clustered. Only for attribute clustering.
Whether hierarchical clustering was used. Only for attribute clustering.
Method for aggregating attributes (most_frequent/first/last). Only for attribute clustering.
Collection IDs where cluster documents are stored. For single output: list with one collection ID. For per-feature output: list with one collection ID per feature.
Names of output collections. Corresponds to output_collection_ids.
Clustering algorithm used (hdbscan, kmeans, attribute_based, etc.)
Algorithm-specific parameters (not used for attribute_based)
Whether source documents were enriched with cluster_id
Configuration for source enrichment (if enrich_source=True)
{
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
}Configuration for LLM-based cluster labeling (applies to all cluster types)
{
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
}Number of clusters found (excludes noise/outliers, populated after execution)
Total documents processed
Time taken to complete clustering
Whether implicit hierarchy was detected (multi-feature independent) or created (hierarchical attributes)
For child clusters in hierarchy
For parent clusters
Parent-child relationships detected from cluster membership overlap
Cluster job status (propagated from TaskService)
PENDING, IN_PROGRESS, PROCESSING, COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELED, UNKNOWN, SKIPPED, DRAFT, ACTIVE, ARCHIVED, SUSPENDED Most recent task ID for this cluster
When cluster was created
When cluster was last updated
Last execution timestamp
When clustering completed successfully
List of errors encountered during LLM labeling (if any). Stored in MongoDB cluster metadata only, NOT in Qdrant cluster documents. Used to track LLM failures while allowing fallback labels to work.
Additional user-defined metadata