Skip to main content
Deduplicate stage showing removal of duplicate documents
The Deduplicate stage removes duplicate documents from the result set based on exact field matching or content similarity. This is analogous to SQL’s DISTINCT, MongoDB’s $group with $first, and Elasticsearch’s field collapsing.
Stage Category: REDUCE (Removes duplicates)Transformation: N documents → M documents (M ≤ N, duplicates removed)

When to Use

Use CaseDescription
URL deduplicationOne result per source URL after web enrichment
Author collapseKeep one result per author
Content dedupRemove near-identical text chunks
Multi-source mergeRemove overlapping results from multiple searches
Query expansion cleanupRemove duplicates from expanded query results

When NOT to Use

ScenarioRecommended Alternative
Grouping with aggregationgroup_by stage
Sampling unique categoriessample with stratified
Limiting result countlimit stage
Filtering by criteriaattribute_filter

Parameters

ParameterTypeDefaultDescription
strategystringfieldDedup method: field (exact match) or content (similarity)
fieldslist[string]required for fieldField paths to compare for deduplication
content_fieldstringcontentText field for content-based dedup
similarity_thresholdfloat0.95Similarity threshold for content dedup (0.0-1.0)
keepstringfirstWhich duplicate to keep: first or last
case_sensitivebooleantrueWhether string comparisons are case-sensitive

Deduplication Strategies

StrategyPerformanceBest For
fieldO(N) hash-basedExact field matching (URL, ID, title)
contentO(N²) pairwiseNear-duplicate text detection

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "deduplicate",
  "parameters": {
    "strategy": "field",
    "fields": ["metadata.source_url"],
    "keep": "first"
  }
}
For best results, place deduplicate after sorting/reranking so that keep: "first" retains the highest-scored duplicate. This ensures you keep the most relevant version of each document.

Performance

MetricValue
Latency< 5ms (field) / 10-100ms (content)
MemoryO(N) hash set (field) / O(N) content cache (content)
CostFree
ComplexityO(N) field / O(N²) content

Common Pipeline Patterns

Web Search Deduplication

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 50
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "strategy": "field",
      "fields": ["metadata.source_url"]
    }
  }
]

Cross-Collection Dedup

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "document_field": "content"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "strategy": "content",
      "content_field": "content",
      "similarity_threshold": 0.85
    }
  }
]

Error Handling

ErrorBehavior
Field doesn’t existDocuments with missing fields have None as key value
All unique documentsReturns all documents unchanged
Empty inputReturns empty result set
Single documentReturned as-is (no duplicates possible)
  • Group By - Group documents with aggregation
  • Limit - Truncate results after deduplication
  • Sample - Random sampling (different from dedup)
  • Unwind - Inverse: expand grouped items