The Deduplicate stage removes duplicate documents from the result set based on exact field matching or content similarity. This is analogous to SQL’s DISTINCT, MongoDB’s $group with $first, and Elasticsearch’s field collapsing.
Stage Category : REDUCE (Removes duplicates)Transformation : N documents → M documents (M ≤ N, duplicates removed)
When to Use
Use Case Description URL deduplication One result per source URL after web enrichment Author collapse Keep one result per author Content dedup Remove near-identical text chunks Multi-source merge Remove overlapping results from multiple searches Query expansion cleanup Remove duplicates from expanded query results
When NOT to Use
Scenario Recommended Alternative Grouping with aggregation group_by stageSampling unique categories sample with stratifiedLimiting result count limit stageFiltering by criteria attribute_filter
Parameters
Parameter Type Default Description strategystring fieldDedup method: field (exact match) or content (similarity) fieldslist[string] required for field Field paths to compare for deduplication content_fieldstring contentText field for content-based dedup similarity_thresholdfloat 0.95Similarity threshold for content dedup (0.0-1.0) keepstring firstWhich duplicate to keep: first or last case_sensitiveboolean trueWhether string comparisons are case-sensitive
Deduplication Strategies
Strategy Performance Best For fieldO(N) hash-based Exact field matching (URL, ID, title) contentO(N²) pairwise Near-duplicate text detection
Configuration Examples
Deduplicate by URL
Case-Insensitive Author Dedup
Multi-Field Dedup
Content Similarity Dedup
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"strategy" : "field" ,
"fields" : [ "metadata.source_url" ],
"keep" : "first"
}
}
For best results, place deduplicate after sorting/reranking so that keep: "first" retains the highest-scored duplicate. This ensures you keep the most relevant version of each document.
Metric Value Latency < 5ms (field) / 10-100ms (content) Memory O(N) hash set (field) / O(N) content cache (content) Cost Free Complexity O(N) field / O(N²) content
Common Pipeline Patterns
Web Search Deduplication
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 50
}
},
{
"stage_type" : "apply" ,
"stage_id" : "external_web_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"num_results" : 10
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"strategy" : "field" ,
"fields" : [ "metadata.source_url" ]
}
}
]
Cross-Collection Dedup
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 100
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"inference_name" : "baai_bge_reranker_v2_m3" ,
"query" : "{{INPUT.query}}" ,
"document_field" : "content"
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"strategy" : "content" ,
"content_field" : "content" ,
"similarity_threshold" : 0.85
}
}
]
Error Handling
Error Behavior Field doesn’t exist Documents with missing fields have None as key value All unique documents Returns all documents unchanged Empty input Returns empty result set Single document Returned as-is (no duplicates possible)
Group By - Group documents with aggregation
Limit - Truncate results after deduplication
Sample - Random sampling (different from dedup)
Unwind - Inverse: expand grouped items