Skip to main content
Deduplicate stage showing duplicate document removal
The Deduplicate stage removes duplicate or near-duplicate documents from results. It supports exact field matching, content hashing, and semantic similarity deduplication.
Stage Category: REDUCE (Aggregates/reduces document set)Transformation: N documents → M documents (where M ≤ N, duplicates removed)

When to Use

Use CaseDescription
Cross-source deduplicationSame content from multiple sources
Near-duplicate removalSlightly different versions of same doc
Chunked document cleanupRemove overlapping chunks
Result diversityEnsure varied search results

When NOT to Use

ScenarioRecommended Alternative
Exact ID matchingPre-filter in database
Large-scale dedupProcess during indexing
Complex similarity logicCustom api_call

Parameters

ParameterTypeDefaultDescription
methodstringcontent_hashDeduplication method
fieldstringnullField for exact/hash matching
similarity_thresholdfloat0.95For semantic dedup (0.0-1.0)
keepstringfirstWhich duplicate to keep: first, last, highest_score
content_fieldstringcontentField for content comparison

Deduplication Methods

MethodDescriptionSpeedUse Case
exact_fieldExact field value matchFastMatching IDs or hashes
content_hashHash-based content matchFastExact content duplicates
semanticEmbedding similaritySlowNear-duplicates

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "deduplicate",
  "parameters": {
    "method": "content_hash",
    "content_field": "content",
    "keep": "highest_score"
  }
}

Keep Strategies

StrategyBehavior
firstKeep first occurrence in result order
lastKeep last occurrence
highest_scoreKeep document with highest relevance score
Use highest_score when deduplicating search results to retain the most relevant version of duplicate content.

Output Schema

Documents are returned with duplicates removed:
[
  {
    "document_id": "doc_123",
    "content": "Original content...",
    "score": 0.95,
    "dedup_info": {
      "is_duplicate": false,
      "cluster_size": 3
    }
  },
  {
    "document_id": "doc_789",
    "content": "Different content...",
    "score": 0.88,
    "dedup_info": {
      "is_duplicate": false,
      "cluster_size": 1
    }
  }
]
The cluster_size indicates how many duplicates were found (including the kept document).

Performance

MethodLatencyMemory
exact_fieldO(n)Low
content_hashO(n)Low
semanticO(n²)High
MetricValue
exact_field/hash< 10ms for 100 docs
semantic50-200ms for 100 docs
Max practical size500 docs for semantic
Semantic deduplication compares all document pairs. For large result sets, use content_hash or limit results first.

Common Pipeline Patterns

Search + Dedup + Rerank

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "method": "content_hash",
      "keep": "highest_score"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 10
    }
  }
]

Multi-Source Search with Dedup

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 50
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 20
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "method": "semantic",
      "similarity_threshold": 0.90,
      "keep": "highest_score"
    }
  }
]

Chunk-Level Deduplication

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "method": "exact_field",
      "field": "metadata.parent_document_id",
      "keep": "highest_score"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "limit",
    "parameters": {
      "limit": 10
    }
  }
]

How Each Method Works

exact_field

Groups documents by exact field value match:
doc1.metadata.url = "https://example.com/page1"
doc2.metadata.url = "https://example.com/page1"  <- duplicate
doc3.metadata.url = "https://example.com/page2"

content_hash

Computes hash of content field:
hash(doc1.content) = "abc123"
hash(doc2.content) = "abc123"  <- duplicate (same hash)
hash(doc3.content) = "def456"

semantic

Computes embedding similarity between all pairs:
similarity(doc1, doc2) = 0.97  <- duplicates (> 0.95 threshold)
similarity(doc1, doc3) = 0.42  <- not duplicates
similarity(doc2, doc3) = 0.45  <- not duplicates

Error Handling

ErrorBehavior
Missing fieldDocument treated as unique
Empty contentHash comparison skipped
Embedding failureFalls back to content_hash