curl --request PATCH \
--url https://api.mixpeek.com/v1/clusters/{cluster_identifier} \
--header 'Authorization: <authorization>' \
--header 'Content-Type: application/json' \
--header 'X-Namespace: <x-namespace>' \
--data '
{
"cluster_name": "<string>",
"description": "<string>",
"metadata": {}
}
'{
"collection_ids": [
"<string>"
],
"cluster_name": "<string>",
"cluster_type": "vector",
"vector_config": {
"algorithm_params": {
"min_cluster_size": 10,
"min_samples": 5
},
"clustering_method": "hdbscan",
"description": "HDBSCAN clustering with multimodal embeddings",
"feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
"sample_size": 1000
},
"attribute_config": {
"attributes": [
"category"
],
"description": "Simple category clustering",
"hierarchical_grouping": false
},
"filters": {
"AND": [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
],
"OR": [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
],
"NOT": [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
],
"case_sensitive": true
},
"llm_labeling": {
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
},
"enrich_source_collection": false,
"source_enrichment_config": {
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
},
"cluster_id": "<string>",
"parquet_path": "<string>",
"members_key": "<string>",
"num_clusters": 123,
"cluster_stats": {
"num_clusters": 123,
"noise_points": 123,
"silhouette_score": 123,
"extra": {}
},
"status": "PENDING",
"task_id": "<string>",
"last_run_id": "<string>",
"created_at": "2023-11-07T05:31:56Z",
"updated_at": "2023-11-07T05:31:56Z",
"metadata": {}
}This endpoint partially updates a cluster (PATCH operation). Only provided fields will be updated. At minimum, metadata can always be updated. Immutable fields like cluster_id, status, and computed fields cannot be modified.
curl --request PATCH \
--url https://api.mixpeek.com/v1/clusters/{cluster_identifier} \
--header 'Authorization: <authorization>' \
--header 'Content-Type: application/json' \
--header 'X-Namespace: <x-namespace>' \
--data '
{
"cluster_name": "<string>",
"description": "<string>",
"metadata": {}
}
'{
"collection_ids": [
"<string>"
],
"cluster_name": "<string>",
"cluster_type": "vector",
"vector_config": {
"algorithm_params": {
"min_cluster_size": 10,
"min_samples": 5
},
"clustering_method": "hdbscan",
"description": "HDBSCAN clustering with multimodal embeddings",
"feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
"sample_size": 1000
},
"attribute_config": {
"attributes": [
"category"
],
"description": "Simple category clustering",
"hierarchical_grouping": false
},
"filters": {
"AND": [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
],
"OR": [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
],
"NOT": [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
],
"case_sensitive": true
},
"llm_labeling": {
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
},
"enrich_source_collection": false,
"source_enrichment_config": {
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
},
"cluster_id": "<string>",
"parquet_path": "<string>",
"members_key": "<string>",
"num_clusters": 123,
"cluster_stats": {
"num_clusters": 123,
"noise_points": 123,
"silhouette_score": 123,
"extra": {}
},
"status": "PENDING",
"task_id": "<string>",
"last_run_id": "<string>",
"created_at": "2023-11-07T05:31:56Z",
"updated_at": "2023-11-07T05:31:56Z",
"metadata": {}
}REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.
REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'
Cluster ID or name
Successful Response
Cluster metadata stored in MongoDB.
Collections to cluster together
1Optional human-friendly name for the clustering job
Vector or attribute clustering
vector, attribute Required when cluster_type is 'vector'
Show child attributes
Clustering algorithm to use
kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, optics, attribute_based DEPRECATED: Use feature_uris instead. Canonical feature URI for the vector embedding to cluster. Format: 'mixpeek://{extractor}@{version}/{output}'. For multi-feature clustering, use feature_uris (plural) instead.
"mixpeek://multimodal_extractor@v1/multimodal_embedding"
RECOMMENDED. List of feature URIs to cluster. Format: 'mixpeek://{extractor}@{version}/{output}'. For single-feature clustering, provide a list with one element. For multi-feature clustering, provide multiple feature URIs. Each feature must exist in all input collections.
1["mixpeek://text_extractor@v1/embedding"]Number of samples to use for clustering
Parameters for K-means clustering (deprecated, use algorithm_params)
Show child attributes
Number of clusters to form
2 <= x <= 1000Maximum number of iterations
1 <= x <= 10000Random seed for reproducibility
Number of times k-means will run with different centroid seeds
x >= 1Tolerance for convergence
Method for initialization ('k-means++' or 'random')
Verbosity mode
x >= 0If True, the original data is not modified
K-means algorithm to use ('lloyd', 'elkan', or 'auto')
Parameters for DBSCAN clustering (deprecated, use algorithm_params)
Show child attributes
Maximum distance between two samples for one to be considered in the neighborhood of the other
Number of samples in a neighborhood for a point to be considered a core point
x >= 1Metric to use for distance computation
Additional keyword arguments for the metric function
Algorithm to compute pointwise distances and find nearest neighbors ('auto', 'ball_tree', 'kd_tree', 'brute')
Leaf size passed to BallTree or KDTree
x >= 1The power of the Minkowski metric to be used to calculate distance between points
The number of parallel jobs to run (-1 means using all processors)
Parameters for HDBSCAN clustering (deprecated, use algorithm_params)
Show child attributes
Minimum number of samples in a cluster
x >= 2Number of samples in a neighborhood for a point to be considered a core point. Defaults to min_cluster_size if None
x >= 1A distance threshold for cluster selection. Clusters below this value will be merged
x >= 0Maximum number of samples in a cluster. Clusters above this size will be split
x >= 1Metric to use for distance computation
A distance scaling parameter
Method to select clusters from the condensed tree ('eom' or 'leaf')
Allow HDBSCAN to find only a single cluster
Whether to generate extra data for predicting cluster membership
Whether to match the reference implementation exactly
Algorithm-specific parameters
Show child attributes
Number of clusters to form
2 <= x <= 1000Maximum number of iterations
1 <= x <= 10000Random seed for reproducibility
Number of times k-means will run with different centroid seeds
x >= 1Tolerance for convergence
Method for initialization ('k-means++' or 'random')
Verbosity mode
x >= 0If True, the original data is not modified
K-means algorithm to use ('lloyd', 'elkan', or 'auto')
Strategy for handling multiple feature vectors:
concatenate, independent, weighted Apply L2 normalization to each feature block before concatenation. Prevents feature dominance when combining different modalities. Only applies when multi_feature_strategy='concatenate'.
Optional per-feature weights (0.0-1.0) for concatenation strategy. Keys are feature URIs, values are weights. Example: {'mixpeek://text@v1/emb': 0.7, 'mixpeek://image@v1/emb': 0.3}. Defaults to equal weights (1.0) if not specified. Only applies when multi_feature_strategy='concatenate'. If multi_feature_strategy='weighted' and this is None, weights are learned automatically using weight_learning_config.
Show child attributes
{
"mixpeek://image_extractor@v1/embedding": 0.3,
"mixpeek://text_extractor@v1/embedding": 0.7
}Configuration for automatic feature weight learning. Only used when multi_feature_strategy='weighted' and feature_weights is None. If feature_weights is provided, manual weights are used instead of learning. If this is None when learning is needed, default WeightLearningConfig is used.
Show child attributes
Weight learning method:
grid_search, bayesian Maximum optimization iterations:
5 <= x <= 100Clustering quality metric to optimize:
silhouette, davies_bouldin, calinski_harabasz Optional: Learn weights on a random sample (speeds up large datasets). If provided and dataset has more documents, weights are learned on sample_size random documents, then applied to full dataset. Recommended: 5000 for datasets >10k documents
x >= 1005000
Random seed for reproducibility of weight learning
{
"max_iterations": 20,
"method": "bayesian",
"metric": "silhouette",
"random_state": 42,
"sample_size": 5000
}Output collection creation strategy:
single, per_feature Method for calculating cluster centroids:
mean, median, medoid Whether to enrich source documents with cluster_id
{
"algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
"clustering_method": "hdbscan",
"description": "HDBSCAN clustering with multimodal embeddings",
"feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
"sample_size": 1000
}Required when cluster_type is 'attribute'
Show child attributes
List of attribute field names to use for clustering. Documents will be grouped by unique combinations of these attribute values. Supports dot-notation for nested fields (e.g., 'metadata.category'). Order matters for hierarchical grouping: first attribute is top-level, subsequent are nested.
1Whether to create hierarchical clusters based on attribute order. When True: Creates parent clusters for each unique value of the first attribute, then child clusters for subsequent attributes within each parent. When False: Creates flat clusters for each unique combination of all attributes. Example with ['category', 'brand']: hierarchical=True → 'Electronics' (parent) → 'Apple', 'Samsung' (children). hierarchical=False → 'Electronics_Apple', 'Electronics_Samsung' (flat).
Method for aggregating attribute values when creating cluster centroids. Options: 'most_frequent' (default), 'first', 'last'. Most use cases should use the default.
"most_frequent"
{
"attributes": ["category"],
"description": "Simple category clustering",
"hierarchical_grouping": false
}Optional filters to pre-filter documents before clustering (same format as list documents). Applied during Qdrant scroll before parquet export. Useful for clustering subsets like: status='active', category='electronics', etc.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
Optional configuration for LLM-based cluster labeling. When provided with enabled=True, clusters will have semantic labels generated by LLM instead of generic labels like 'Cluster 0'. When not provided or enabled=False, uses fallback labels.
Show child attributes
Whether to generate labels for clusters using LLM. When enabled, clusters will have semantic labels like 'High-Performance Laptops' instead of generic labels like 'Cluster 0'.
Input configuration for LLM labeling. Supports flexible input mappings for multimodal inputs (text, images, videos, audio). Use input_mappings for advanced multimodal labeling with providers like Gemini. If not provided (null/undefined), the full document payload will be serialized as JSON and sent to the LLM, providing complete context for semantic labeling.
Show child attributes
Flexible input mappings for constructing LLM context. Supports multimodal inputs (text, image_url, video_url, audio_url). Each mapping specifies how to extract data from document payloads. At least one input mapping is required.
1Show child attributes
Key used in the constructed inputs payload.
Source of the value (payload, literal, vector).
payload, literal, vector Dot-notation path inside payload/vector when source_type is PAYLOAD or VECTOR.
Static value used when source_type is LITERAL. Overrides any path.
LLM provider to use for labeling. Supported providers:
If not specified, automatically inferred from model_name.
openai, google, anthropic "openai"
REQUIRED when enabled=True. Specific LLM model to use for cluster labeling. All models are defined as enums for type safety.
OpenAI Models (provider='openai'):
Google Models (provider='google'):
Anthropic Models (provider='anthropic'):
Recommendation:
gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18, gpt-4.1-2025-04-14, gpt-4.1-mini-2025-04-14, o3-mini-2025-01-31 "gpt-4o-mini-2024-07-18"
Whether to generate cluster summaries
Whether to extract keywords for clusters
Maximum representative documents to send to LLM per cluster for semantic analysis
1 <= x <= 20Maximum characters per document sample text
50 <= x <= 500Enable embedding-based label deduplication to prevent near-duplicate labels (requires sentence-transformers)
Cosine similarity threshold for duplicate label detection (labels above this are considered duplicates)
0.5 <= x <= 1Time-to-live for cached labels in seconds. Labels for clusters with identical representative documents will be reused within this TTL window, reducing LLM API costs. Default: 604800 (7 days). Set to 0 to disable caching.
0 <= x <= 2592000OPTIONAL. Custom prompt template for LLM labeling. NOT REQUIRED - uses default discriminative prompt if not provided. When provided, completely replaces the default prompt. Your custom prompt receives cluster information but you must format it yourself. Use when: - Need domain-specific labeling (e.g., medical, legal, technical) - Want different label format (e.g., emoji labels, code names) - Require specific output structure - Have custom business logic for categorization Default prompt includes: cluster document samples, forbidden labels for uniqueness, and JSON response format. See engine/clusters/labeling/prompts.py for reference. Example: 'Analyze these product clusters and generate SHORT category names (2-3 words max) focusing on product type and price range. Return JSON: [{"cluster_id": "cl_0", "label": "..."}]'
"Analyze these document clusters and generate technical labels (2-3 words). Focus on programming languages and frameworks mentioned. Return JSON: [{'cluster_id': 'cl_0', 'label': '...', 'keywords': [...]}]"
OPTIONAL. Define custom structured output for LLM labeling. NOT REQUIRED - uses default structure (label, summary, keywords) if not provided. When provided, LLM output will match this structure and be stored in cluster documents.
Two modes supported:
Natural language prompt (string): Describe desired output in plain English
Explicit JSON schema (dict): Provide complete JSON schema for output structure
Use when:
Output fields are automatically added to cluster collection schema and stored in metadata. Default behavior (if not provided): label (string), summary (string), keywords (array of strings)
"Extract cluster category, confidence score between 0 and 1, and top 3 representative keywords"
Provider-specific parameters forwarded to the LLM service. For OpenAI: temperature, max_tokens, top_p, json_output, etc. For Google: temperature, top_k, max_output_tokens, json_output, etc.
{
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
}If True, cluster results are written back to source collection(s) in-place instead of creating new output collections. Documents will be enriched with cluster_id, cluster_label, distance_to_centroid, and optionally other metadata. Similar to taxonomy enrichment pattern.
Configuration for source collection enrichment (only used if enrich_source_collection=True). Controls which fields are added to source documents and field naming conventions.
Show child attributes
List of field mappings from cluster results to document fields. Default includes cluster_id and cluster_label. Can include: distance_to_centroid, member_count, keywords, visualization coords (x, y, z), etc.
Show child attributes
Field from cluster results to include. Available fields: cluster_id, cluster_label, distance_to_centroid, member_count, keywords, x, y, z (visualization coords), metadata.*
Target field name in enriched document. Example: 'category_id' for cluster_id, 'product_category' for cluster_label
{
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
}Unique cluster identifier
S3 path to parquet files with cluster data
S3 key to members.parquet (if saved)
Number of clusters found
Clustering job status
PENDING, IN_PROGRESS, PROCESSING, COMPLETED, COMPLETED_WITH_ERRORS, FAILED, CANCELED, UNKNOWN, SKIPPED, DRAFT, ACTIVE, ARCHIVED, SUSPENDED Associated task ID for clustering job
Run ID of the most recent successful clustering execution. Used to retrieve execution results.
When the cluster was created
When the cluster was last updated
Additional user-defined metadata for the cluster
Was this page helpful?