The Cluster stage groups documents based on embedding similarity, creating semantic clusters of related content. This helps organize search results into meaningful groups and discover themes within your results.
Stage Category : REDUCE (Groups documents)Transformation : N documents → K clusters with documents
When to Use
Use Case Description Theme discovery Find topics within search results Result organization Group similar items together Deduplication Find near-duplicate content Exploration Understand result diversity
When NOT to Use
Scenario Recommended Alternative Grouping by field value group_byRemoving duplicates deduplicatePre-defined categories taxonomy_enrichSingle representative sample per group
Parameters
Parameter Type Default Description num_clustersinteger 5Number of clusters to create embedding_fieldstring auto Field containing embeddings algorithmstring kmeansClustering algorithm min_cluster_sizeinteger 2Minimum documents per cluster include_outliersboolean trueInclude documents that don’t fit clusters label_clustersboolean falseGenerate cluster labels with LLM
Clustering Algorithms
Algorithm Description Best For kmeansK-means clustering Fixed number of clusters hdbscanDensity-based clustering Unknown cluster count agglomerativeHierarchical clustering Nested clusters
Configuration Examples
Basic Clustering
Auto-Labeled Clusters
Density-Based Clustering
Custom Embedding Field
Fine-Grained Clustering
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"num_clusters" : 5
}
}
How Clustering Works
Extract Embeddings : Get embedding vectors from each document
Apply Algorithm : Run clustering algorithm (e.g., k-means)
Assign Documents : Each document assigned to nearest cluster
Compute Centroids : Calculate cluster centers
Label (optional) : Generate human-readable cluster names
Output Schema
{
"clusters" : [
{
"cluster_id" : 0 ,
"label" : "Machine Learning Tutorials" ,
"centroid" : [ 0.12 , -0.34 , ... ],
"size" : 12 ,
"documents" : [
{
"document_id" : "doc_123" ,
"content" : "Introduction to neural networks..." ,
"score" : 0.95 ,
"cluster" : {
"cluster_id" : 0 ,
"distance_to_centroid" : 0.15
}
}
]
},
{
"cluster_id" : 1 ,
"label" : "Data Engineering" ,
"centroid" : [ 0.45 , 0.23 , ... ],
"size" : 8 ,
"documents" : [ ... ]
}
],
"outliers" : [
{
"document_id" : "doc_789" ,
"content" : "Unrelated content..." ,
"outlier_reason" : "distance_threshold_exceeded"
}
],
"metadata" : {
"algorithm" : "kmeans" ,
"num_clusters" : 5 ,
"total_documents" : 50 ,
"clustered_documents" : 48 ,
"outlier_count" : 2
}
}
Metric Value Latency 50-200ms Memory O(N × embedding_dim) Cost Free (+ LLM cost if labeling) Scalability Up to ~10K documents
Clustering large document sets (10K+) can be slow. Consider pre-filtering or sampling before clustering.
Common Pipeline Patterns
Search + Cluster
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"num_clusters" : 5 ,
"label_clusters" : true
}
}
]
Cluster + Sample Representatives
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 200
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"num_clusters" : 10 ,
"label_clusters" : true
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "stratified" ,
"group_field" : "cluster.cluster_id" ,
"samples_per_group" : 2
}
}
]
Theme Discovery Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.date" ,
"operator" : "gte" ,
"value" : "2024-01-01"
}
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"algorithm" : "hdbscan" ,
"min_cluster_size" : 5 ,
"label_clusters" : true
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "summarize" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Summarize the main themes found in these clusters" ,
"mode" : "aggregate"
}
}
]
Diverse Results Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"num_clusters" : 5
}
},
{
"stage_type" : "apply" ,
"stage_id" : "code_execution" ,
"parameters" : {
"code" : "def transform(doc): \n # Select top doc from each cluster \n clusters = doc.get('clusters', []) \n results = [] \n for c in clusters: \n if c['documents']: \n results.append(c['documents'][0]) \n doc['diverse_results'] = results \n return doc"
}
}
]
Cluster Labeling
When label_clusters: true, an LLM generates descriptive labels:
Cluster Documents Generated Label Docs about Python ML ”Python Machine Learning” Docs about cloud infra ”Cloud Infrastructure” Docs about API design ”REST API Design Patterns”
Choosing num_clusters
Result Size Recommended Clusters < 50 docs 3-5 clusters 50-200 docs 5-10 clusters 200-500 docs 8-15 clusters 500+ docs 10-20 clusters
Start with fewer clusters and increase if clusters are too broad. Use HDBSCAN if you don’t know the optimal number.
Error Handling
Error Behavior Missing embeddings Skip document Too few documents Reduce num_clusters Clustering fails Return unclustered docs Labeling fails Use numeric labels