Skip to main content
POST
/
v1
/
clusters
Create Cluster
curl --request POST \
  --url https://api.mixpeek.com/v1/clusters \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --header 'X-Namespace: <x-namespace>' \
  --data '
{
  "collection_ids": [
    "<string>"
  ],
  "cluster_name": "<string>",
  "cluster_type": "vector",
  "vector_config": {
    "algorithm_params": {
      "min_cluster_size": 10,
      "min_samples": 5
    },
    "clustering_method": "hdbscan",
    "description": "HDBSCAN clustering with multimodal embeddings",
    "feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
    "sample_size": 1000
  },
  "attribute_config": {
    "attributes": [
      "category"
    ],
    "description": "Simple category clustering",
    "hierarchical_grouping": false
  },
  "filters": {
    "AND": [
      {
        "field": "name",
        "operator": "eq",
        "value": "John"
      },
      {
        "field": "age",
        "operator": "gte",
        "value": 30
      }
    ],
    "OR": [
      {
        "field": "status",
        "operator": "eq",
        "value": "active"
      },
      {
        "field": "role",
        "operator": "eq",
        "value": "admin"
      }
    ],
    "NOT": [
      {
        "field": "department",
        "operator": "eq",
        "value": "HR"
      },
      {
        "field": "location",
        "operator": "eq",
        "value": "remote"
      }
    ],
    "case_sensitive": true
  },
  "llm_labeling": {
    "description": "Text-only labeling with multiple fields",
    "enabled": true,
    "include_keywords": true,
    "include_summary": true,
    "labeling_inputs": {
      "input_mappings": [
        {
          "input_key": "title",
          "path": "title",
          "source_type": "payload"
        },
        {
          "input_key": "description",
          "path": "description",
          "source_type": "payload"
        },
        {
          "input_key": "text",
          "path": "text",
          "source_type": "payload"
        }
      ]
    },
    "model_name": "gpt-4o-mini-2024-07-18",
    "provider": "openai"
  },
  "enrich_source_collection": false,
  "source_enrichment_config": {
    "field_mappings": [
      {
        "source_field": "cluster_id",
        "target_field": "category_id"
      },
      {
        "source_field": "cluster_label",
        "target_field": "category_name"
      },
      {
        "source_field": "distance_to_centroid",
        "target_field": "category_confidence"
      }
    ]
  }
}
'
{
  "cluster_name": "<string>",
  "namespace_id": "<string>",
  "organization_id": "<string>",
  "input_collections": [
    "<string>"
  ],
  "cluster_type": "vector",
  "cluster_id": "<string>",
  "source_bucket_ids": [
    "<string>"
  ],
  "filters": {},
  "feature_uris": [
    "<string>"
  ],
  "multi_feature_strategy": "<string>",
  "learned_weights": {},
  "learning_quality_score": 123,
  "effective_feature_method": "<string>",
  "clustered_attributes": [
    "<string>"
  ],
  "hierarchical_grouping": true,
  "aggregation_method": "<string>",
  "output_collection_ids": [
    "<string>"
  ],
  "output_collection_names": [
    "<string>"
  ],
  "algorithm": "<string>",
  "algorithm_params": {},
  "enrich_source": false,
  "source_enrichment_config": {
    "field_mappings": [
      {
        "source_field": "cluster_id",
        "target_field": "category_id"
      },
      {
        "source_field": "cluster_label",
        "target_field": "category_name"
      },
      {
        "source_field": "distance_to_centroid",
        "target_field": "category_confidence"
      }
    ]
  },
  "llm_labeling": {
    "description": "Text-only labeling with multiple fields",
    "enabled": true,
    "include_keywords": true,
    "include_summary": true,
    "labeling_inputs": {
      "input_mappings": [
        {
          "input_key": "title",
          "path": "title",
          "source_type": "payload"
        },
        {
          "input_key": "description",
          "path": "description",
          "source_type": "payload"
        },
        {
          "input_key": "text",
          "path": "text",
          "source_type": "payload"
        }
      ]
    },
    "model_name": "gpt-4o-mini-2024-07-18",
    "provider": "openai"
  },
  "num_clusters": 123,
  "num_documents_clustered": 123,
  "execution_time_seconds": 123,
  "hierarchy_detected": false,
  "parent_cluster_id": "<string>",
  "child_cluster_ids": [
    "<string>"
  ],
  "hierarchy_relationships": [
    {}
  ],
  "status": "PENDING",
  "last_execution_task_id": "<string>",
  "created_at": "2023-11-07T05:31:56Z",
  "updated_at": "2023-11-07T05:31:56Z",
  "last_executed_at": "2023-11-07T05:31:56Z",
  "completed_at": "2023-11-07T05:31:56Z",
  "llm_labeling_errors": [
    "<string>"
  ],
  "metadata": {}
}

Headers

Authorization
string
required

REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.

X-Namespace
string
required

REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'

Body

application/json

Create a clustering job for one or more collections.

collection_ids
string[]
required

Collections to cluster together

Minimum array length: 1
cluster_name
string | null

Optional human-friendly name for the clustering job

cluster_type
enum<string>
default:vector

Vector or attribute clustering

Available options:
vector,
attribute
vector_config
VectorBasedConfig · object

Required when cluster_type is 'vector'

Example:
{
"algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
"clustering_method": "hdbscan",
"description": "HDBSCAN clustering with multimodal embeddings",
"feature_uri": "mixpeek://multimodal_extractor@v1/multimodal_embedding",
"sample_size": 1000
}
attribute_config
AttributeBasedConfig · object

Required when cluster_type is 'attribute'

Example:
{
"attributes": ["category"],
"description": "Simple category clustering",
"hierarchical_grouping": false
}
filters
LogicalOperator · object

Optional filters to pre-filter documents before clustering (same format as list documents). Applied during Qdrant scroll before parquet export. Useful for clustering subsets like: status='active', category='electronics', etc.

llm_labeling
LLMLabeling · object

Optional configuration for LLM-based cluster labeling. When provided with enabled=True, clusters will have semantic labels generated by LLM instead of generic labels like 'Cluster 0'. When not provided or enabled=False, uses fallback labels.

Example:
{
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
}
enrich_source_collection
boolean
default:false

If True, cluster results are written back to source collection(s) in-place instead of creating new output collections. Documents will be enriched with cluster_id, cluster_label, distance_to_centroid, and optionally other metadata. Similar to taxonomy enrichment pattern.

source_enrichment_config
SourceEnrichmentConfig · object

Configuration for source collection enrichment (only used if enrich_source_collection=True). Controls which fields are added to source documents and field naming conventions.

Example:
{
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
}

Response

Successful Response

Cluster job metadata stored in MongoDB clusters collection.

This is separate from cluster documents themselves. Tracks job-level configuration, status, and summary statistics.

Supports both vector and attribute clustering with appropriate metadata.

cluster_name
string
required

Human-readable cluster name

namespace_id
string
required

Namespace this cluster belongs to

organization_id
string
required

Organization ID (internal_id)

input_collections
string[]
required

Source collection IDs that were clustered

cluster_type
enum<string>
required

Type of clustering: vector (embedding-based) or attribute (metadata-based)

Available options:
vector,
attribute
cluster_id
string

Unique cluster job identifier

source_bucket_ids
string[] | null

Source bucket IDs that the input collections originated from. Enables bucket lineage tracking.

filters
Filters · object

Optional filters that were applied to pre-filter documents before clustering

feature_uris
string[] | null

Feature URIs that were clustered (mixpeek://{extractor}@{version}/{output}). Only for vector clustering.

multi_feature_strategy
string | null

Strategy used if multiple features (concatenate/independent/weighted). Only for vector clustering.

learned_weights
Learned Weights · object

Automatically learned feature weights (when multi_feature_strategy='weighted'). Keys are feature URIs, values are learned weights. Only populated after clustering execution completes.

learning_quality_score
number | null

Clustering quality score from weight learning (e.g., silhouette score). Only populated when multi_feature_strategy='weighted' and weights were learned.

effective_feature_method
string | null

Method for calculating cluster centroids (mean/median/medoid). Only for vector clustering.

clustered_attributes
string[] | null

Attribute field names that were clustered. Only for attribute clustering.

hierarchical_grouping
boolean | null

Whether hierarchical clustering was used. Only for attribute clustering.

aggregation_method
string | null

Method for aggregating attributes (most_frequent/first/last). Only for attribute clustering.

output_collection_ids
string[]

Collection IDs where cluster documents are stored. For single output: list with one collection ID. For per-feature output: list with one collection ID per feature.

output_collection_names
string[]

Names of output collections. Corresponds to output_collection_ids.

algorithm
string | null

Clustering algorithm used (hdbscan, kmeans, attribute_based, etc.)

algorithm_params
Algorithm Params · object

Algorithm-specific parameters (not used for attribute_based)

enrich_source
boolean
default:false

Whether source documents were enriched with cluster_id

source_enrichment_config
SourceEnrichmentConfig · object

Configuration for source enrichment (if enrich_source=True)

Example:
{
"field_mappings": [
{
"source_field": "cluster_id",
"target_field": "category_id"
},
{
"source_field": "cluster_label",
"target_field": "category_name"
},
{
"source_field": "distance_to_centroid",
"target_field": "category_confidence"
}
]
}
llm_labeling
LLMLabeling · object

Configuration for LLM-based cluster labeling (applies to all cluster types)

Example:
{
"description": "Text-only labeling with multiple fields",
"enabled": true,
"include_keywords": true,
"include_summary": true,
"labeling_inputs": {
"input_mappings": [
{
"input_key": "title",
"path": "title",
"source_type": "payload"
},
{
"input_key": "description",
"path": "description",
"source_type": "payload"
},
{
"input_key": "text",
"path": "text",
"source_type": "payload"
}
]
},
"model_name": "gpt-4o-mini-2024-07-18",
"provider": "openai"
}
num_clusters
integer | null

Number of clusters found (excludes noise/outliers, populated after execution)

num_documents_clustered
integer | null

Total documents processed

execution_time_seconds
number | null

Time taken to complete clustering

hierarchy_detected
boolean
default:false

Whether implicit hierarchy was detected (multi-feature independent) or created (hierarchical attributes)

parent_cluster_id
string | null

For child clusters in hierarchy

child_cluster_ids
string[] | null

For parent clusters

hierarchy_relationships
Hierarchy Relationships · object[] | null

Parent-child relationships detected from cluster membership overlap

status
enum<string>
default:PENDING

Cluster job status (propagated from TaskService)

Available options:
PENDING,
IN_PROGRESS,
PROCESSING,
COMPLETED,
COMPLETED_WITH_ERRORS,
FAILED,
CANCELED,
UNKNOWN,
SKIPPED,
DRAFT,
ACTIVE,
ARCHIVED,
SUSPENDED
last_execution_task_id
string | null

Most recent task ID for this cluster

created_at
string<date-time>

When cluster was created

updated_at
string<date-time>

When cluster was last updated

last_executed_at
string<date-time> | null

Last execution timestamp

completed_at
string<date-time> | null

When clustering completed successfully

llm_labeling_errors
string[] | null

List of errors encountered during LLM labeling (if any). Stored in MongoDB cluster metadata only, NOT in Qdrant cluster documents. Used to track LLM failures while allowing fallback labels to work.

metadata
Metadata · object

Additional user-defined metadata