Skip to main content
Cluster stage showing semantic grouping of search results
The Cluster stage groups documents based on embedding similarity, creating semantic clusters of related content. This helps organize search results into meaningful groups and discover themes within your results.
Stage Category: REDUCE (Groups documents)Transformation: N documents → K clusters with documents

When to Use

Use CaseDescription
Theme discoveryFind topics within search results
Result organizationGroup similar items together
DeduplicationFind near-duplicate content
ExplorationUnderstand result diversity

When NOT to Use

ScenarioRecommended Alternative
Grouping by field valuegroup_by
Removing duplicatesdeduplicate
Pre-defined categoriestaxonomy_enrich
Single representativesample per group

Parameters

ParameterTypeDefaultDescription
num_clustersinteger5Number of clusters to create
embedding_fieldstringautoField containing embeddings
algorithmstringkmeansClustering algorithm
min_cluster_sizeinteger2Minimum documents per cluster
include_outliersbooleantrueInclude documents that don’t fit clusters
label_clustersbooleanfalseGenerate cluster labels with LLM

Clustering Algorithms

AlgorithmDescriptionBest For
kmeansK-means clusteringFixed number of clusters
hdbscanDensity-based clusteringUnknown cluster count
agglomerativeHierarchical clusteringNested clusters

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "cluster",
  "parameters": {
    "num_clusters": 5
  }
}

How Clustering Works

  1. Extract Embeddings: Get embedding vectors from each document
  2. Apply Algorithm: Run clustering algorithm (e.g., k-means)
  3. Assign Documents: Each document assigned to nearest cluster
  4. Compute Centroids: Calculate cluster centers
  5. Label (optional): Generate human-readable cluster names

Output Schema

{
  "clusters": [
    {
      "cluster_id": 0,
      "label": "Machine Learning Tutorials",
      "centroid": [0.12, -0.34, ...],
      "size": 12,
      "documents": [
        {
          "document_id": "doc_123",
          "content": "Introduction to neural networks...",
          "score": 0.95,
          "cluster": {
            "cluster_id": 0,
            "distance_to_centroid": 0.15
          }
        }
      ]
    },
    {
      "cluster_id": 1,
      "label": "Data Engineering",
      "centroid": [0.45, 0.23, ...],
      "size": 8,
      "documents": [...]
    }
  ],
  "outliers": [
    {
      "document_id": "doc_789",
      "content": "Unrelated content...",
      "outlier_reason": "distance_threshold_exceeded"
    }
  ],
  "metadata": {
    "algorithm": "kmeans",
    "num_clusters": 5,
    "total_documents": 50,
    "clustered_documents": 48,
    "outlier_count": 2
  }
}

Performance

MetricValue
Latency50-200ms
MemoryO(N × embedding_dim)
CostFree (+ LLM cost if labeling)
ScalabilityUp to ~10K documents
Clustering large document sets (10K+) can be slow. Consider pre-filtering or sampling before clustering.

Common Pipeline Patterns

Search + Cluster

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "cluster",
    "parameters": {
      "num_clusters": 5,
      "label_clusters": true
    }
  }
]

Cluster + Sample Representatives

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 200
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "cluster",
    "parameters": {
      "num_clusters": 10,
      "label_clusters": true
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "stratified",
      "group_field": "cluster.cluster_id",
      "samples_per_group": 2
    }
  }
]

Theme Discovery Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.date",
        "operator": "gte",
        "value": "2024-01-01"
      }
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "cluster",
    "parameters": {
      "algorithm": "hdbscan",
      "min_cluster_size": 5,
      "label_clusters": true
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Summarize the main themes found in these clusters",
      "mode": "aggregate"
    }
  }
]

Diverse Results Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "cluster",
    "parameters": {
      "num_clusters": 5
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "code_execution",
    "parameters": {
      "code": "def transform(doc):\n    # Select top doc from each cluster\n    clusters = doc.get('clusters', [])\n    results = []\n    for c in clusters:\n        if c['documents']:\n            results.append(c['documents'][0])\n    doc['diverse_results'] = results\n    return doc"
    }
  }
]

Cluster Labeling

When label_clusters: true, an LLM generates descriptive labels:
Cluster DocumentsGenerated Label
Docs about Python ML”Python Machine Learning”
Docs about cloud infra”Cloud Infrastructure”
Docs about API design”REST API Design Patterns”

Choosing num_clusters

Result SizeRecommended Clusters
< 50 docs3-5 clusters
50-200 docs5-10 clusters
200-500 docs8-15 clusters
500+ docs10-20 clusters
Start with fewer clusters and increase if clusters are too broad. Use HDBSCAN if you don’t know the optimal number.

Error Handling

ErrorBehavior
Missing embeddingsSkip document
Too few documentsReduce num_clusters
Clustering failsReturn unclustered docs
Labeling failsUse numeric labels