The Deduplicate stage removes duplicate or near-duplicate documents from results. It supports exact field matching, content hashing, and semantic similarity deduplication.
Stage Category : REDUCE (Aggregates/reduces document set)Transformation : N documents → M documents (where M ≤ N, duplicates removed)
When to Use
Use Case Description Cross-source deduplication Same content from multiple sources Near-duplicate removal Slightly different versions of same doc Chunked document cleanup Remove overlapping chunks Result diversity Ensure varied search results
When NOT to Use
Scenario Recommended Alternative Exact ID matching Pre-filter in database Large-scale dedup Process during indexing Complex similarity logic Custom api_call
Parameters
Parameter Type Default Description methodstring content_hashDeduplication method fieldstring nullField for exact/hash matching similarity_thresholdfloat 0.95For semantic dedup (0.0-1.0) keepstring firstWhich duplicate to keep: first, last, highest_score content_fieldstring contentField for content comparison
Deduplication Methods
Method Description Speed Use Case exact_fieldExact field value match Fast Matching IDs or hashes content_hashHash-based content match Fast Exact content duplicates semanticEmbedding similarity Slow Near-duplicates
Configuration Examples
Content Hash Deduplication
Exact Field Match
Semantic Deduplication
Title-Based Deduplication
Aggressive Near-Duplicate Removal
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"method" : "content_hash" ,
"content_field" : "content" ,
"keep" : "highest_score"
}
}
Keep Strategies
Strategy Behavior firstKeep first occurrence in result order lastKeep last occurrence highest_scoreKeep document with highest relevance score
Use highest_score when deduplicating search results to retain the most relevant version of duplicate content.
Output Schema
Documents are returned with duplicates removed:
[
{
"document_id" : "doc_123" ,
"content" : "Original content..." ,
"score" : 0.95 ,
"dedup_info" : {
"is_duplicate" : false ,
"cluster_size" : 3
}
},
{
"document_id" : "doc_789" ,
"content" : "Different content..." ,
"score" : 0.88 ,
"dedup_info" : {
"is_duplicate" : false ,
"cluster_size" : 1
}
}
]
The cluster_size indicates how many duplicates were found (including the kept document).
Method Latency Memory exact_fieldO(n) Low content_hashO(n) Low semanticO(n²) High
Metric Value exact_field/hash < 10ms for 100 docs semantic 50-200ms for 100 docs Max practical size 500 docs for semantic
Semantic deduplication compares all document pairs. For large result sets, use content_hash or limit results first.
Common Pipeline Patterns
Search + Dedup + Rerank
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"method" : "content_hash" ,
"keep" : "highest_score"
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "bge-reranker-v2-m3" ,
"top_n" : 10
}
}
]
Multi-Source Search with Dedup
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 50
}
},
{
"stage_type" : "apply" ,
"stage_id" : "external_web_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"num_results" : 20
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"method" : "semantic" ,
"similarity_threshold" : 0.90 ,
"keep" : "highest_score"
}
}
]
Chunk-Level Deduplication
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"method" : "exact_field" ,
"field" : "metadata.parent_document_id" ,
"keep" : "highest_score"
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "limit" ,
"parameters" : {
"limit" : 10
}
}
]
How Each Method Works
exact_field
Groups documents by exact field value match:
doc1.metadata.url = "https://example.com/page1"
doc2.metadata.url = "https://example.com/page1" <- duplicate
doc3.metadata.url = "https://example.com/page2"
content_hash
Computes hash of content field:
hash(doc1.content) = "abc123"
hash(doc2.content) = "abc123" <- duplicate (same hash)
hash(doc3.content) = "def456"
semantic
Computes embedding similarity between all pairs:
similarity(doc1, doc2) = 0.97 <- duplicates (> 0.95 threshold)
similarity(doc1, doc3) = 0.42 <- not duplicates
similarity(doc2, doc3) = 0.45 <- not duplicates
Error Handling
Error Behavior Missing field Document treated as unique Empty content Hash comparison skipped Embedding failure Falls back to content_hash