Deduplicate

Deduplicate stage showing removal of duplicate documents

The Deduplicate stage removes duplicate documents from the result set based on exact field matching or content similarity. This is analogous to SQL’s DISTINCT, MongoDB’s $group with $first, and Elasticsearch’s field collapsing.

Stage Category: REDUCE (Removes duplicates)Transformation: N documents → M documents (M ≤ N, duplicates removed)

When to Use

Use Case	Description
URL deduplication	One result per source URL after web enrichment
Author collapse	Keep one result per author
Content dedup	Remove near-identical text chunks
Multi-source merge	Remove overlapping results from multiple searches
Query expansion cleanup	Remove duplicates from expanded query results

When NOT to Use

Scenario	Recommended Alternative
Grouping with aggregation	`group_by` stage
Sampling unique categories	`sample` with stratified
Limiting result count	`limit` stage
Filtering by criteria	`attribute_filter`

Parameters

Parameter	Type	Default	Description
`strategy`	string	`field`	Dedup method: `field` (exact match) or `content` (similarity)
`fields`	list[string]	required for field	Field paths to compare for deduplication
`content_field`	string	`content`	Text field for content-based dedup
`similarity_threshold`	float	`0.95`	Similarity threshold for content dedup (0.0-1.0)
`keep`	string	`first`	Which duplicate to keep: `first` or `last`
`case_sensitive`	boolean	`true`	Whether string comparisons are case-sensitive

Deduplication Strategies

Strategy	Performance	Best For
`field`	O(N) hash-based	Exact field matching (URL, ID, title)
`content`	O(N²) pairwise	Near-duplicate text detection

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "deduplicate",
  "parameters": {
    "strategy": "field",
    "fields": ["metadata.source_url"],
    "keep": "first"
  }
}

For best results, place deduplicate after sorting/reranking so that keep: "first" retains the highest-scored duplicate. This ensures you keep the most relevant version of each document.

Performance

Metric	Value
Latency	< 5ms (field) / 10-100ms (content)
Memory	O(N) hash set (field) / O(N) content cache (content)
Cost	Free
Complexity	O(N) field / O(N²) content

Common Pipeline Patterns

Web Search Deduplication

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 50
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "strategy": "field",
      "fields": ["metadata.source_url"]
    }
  }
]

Cross-Collection Dedup

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "document_field": "content"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "strategy": "content",
      "content_field": "content",
      "similarity_threshold": 0.85
    }
  }
]

Error Handling

Error	Behavior
Field doesn’t exist	Documents with missing fields have `None` as key value
All unique documents	Returns all documents unchanged
Empty input	Returns empty result set
Single document	Returned as-is (no duplicates possible)

Group By - Group documents with aggregation
Limit - Truncate results after deduplication
Sample - Random sampling (different from dedup)
Unwind - Inverse: expand grouped items

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

When to Use

When NOT to Use

Parameters

Deduplication Strategies

Configuration Examples

Performance

Common Pipeline Patterns

Web Search Deduplication

Cross-Collection Dedup

Error Handling

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​When to Use

​When NOT to Use

​Parameters

​Deduplication Strategies

​Configuration Examples

​Performance

​Common Pipeline Patterns

​Web Search Deduplication

​Cross-Collection Dedup

​Error Handling

​Related

When to Use

When NOT to Use

Parameters

Deduplication Strategies

Configuration Examples

Performance

Common Pipeline Patterns

Web Search Deduplication

Cross-Collection Dedup

Error Handling

Related