Retriever Stages

Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components.

Stage Categories

Stages are organized into five categories based on how they transform the document set:

FILTER

Reduce the document set by matching criteria. Outputs a subset of input documents.Stages: feature_search, attribute_filter, llm_filter, agent_search, query_expand

SORT

Reorder documents by relevance or field values. Same documents, different order.Stages: sort_relevance, sort_attribute, mmr, rerank, score_normalize

REDUCE

Aggregate or reduce the document count. Combine, group, or sample results.Stages: aggregate, group_by, cluster, sample, summarize, limit, deduplicate

APPLY

Enrich or transform documents. May add fields, create new documents, or restructure data.Stages: json_transform, rag_prepare, external_web_search, api_call, sql_lookup, llm_enrich, taxonomy_enrich, document_enrich, cross_compare, web_scrape, unwind

ENRICH

Execute custom code in isolated sandboxes for dynamic enrichments.Stages: code_execution

All Stages

Filter Stages

Stage	Description
Feature Search	Search by vector similarity using multimodal embeddings
Attribute Filter	Filter by metadata fields with boolean logic (AND/OR/NOT)
LLM Filter	Semantic filtering using LLM-based evaluation
Agent Search	LLM-driven multi-step retrieval with iterative reasoning and tool orchestration
Query Expand	LLM-powered query expansion with RRF result fusion

Sort Stages

Stage	Description
Sort Relevance	Reorder by relevance scores
Sort Attribute	Order by any metadata field (dates, price, etc.)
MMR	Diversify results with Maximal Marginal Relevance
Rerank	Re-score with cross-encoder models (e.g., BGE reranker)
Score Normalize	Rescale scores to a common range for consistent comparison

Reduce Stages

Stage	Description
Aggregate	Compute COUNT, SUM, AVG, etc. on results
Group By	Group documents by field value (decompose/recompose)
Cluster	Discover themes via embedding-based clustering
Sample	Random or stratified sampling of results
Summarize	Condense documents into an LLM-generated summary
Limit	Truncate results to a maximum count with optional offset
Deduplicate	Remove duplicate documents by field or content similarity

Apply Stages

Stage	Description
JSON Transform	Reshape documents using Jinja2 templates
RAG Prepare	Format for LLM context with token management and citations
External Web Search	Augment with Exa AI-native web search
API Call	Enrich with external REST API responses
SQL Lookup	Join with PostgreSQL/Snowflake data
LLM Enrich	Generate new fields with LLM prompts
Taxonomy Enrich	Classify documents against taxonomy nodes
Document Enrich	Cross-collection joins (LEFT JOIN)
Cross Compare	Multi-tier cross-collection matching with classification
Web Scrape	Extract full page content from URLs
Unwind	Decompose array fields into separate documents

Enrich Stages

Stage	Description
Code Execution	Execute Python/TypeScript/JavaScript in sandboxes

Pipeline Patterns

Basic RAG Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "document_field": "content",
      "top_k": 10
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "rag_prepare",
    "parameters": {
      "max_tokens": 8000,
      "output_mode": "single_context"
    }
  }
]

E-Commerce Search

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "attribute_filter",
    "parameters": {
      "AND": [
        {"field": "metadata.in_stock", "operator": "eq", "value": true},
        {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
      ]
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "sort_attribute",
    "parameters": {
      "field": "metadata.{{INPUT.sort_by}}",
      "direction": "{{INPUT.sort_order}}"
    }
  }
]

Research Assistant

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 10,
      "category": "research_paper"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "top_k": 15
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Synthesize findings on: {{INPUT.query}}"
    }
  }
]

Enriched Document Retrieval

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 20
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "document_enrich",
    "parameters": {
      "target_collection_id": "col_users",
      "source_field": "metadata.author_id",
      "target_field": "user_id",
      "output_field": "author"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrich",
    "parameters": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "prompt": "Extract key topics and entities from: {{DOC.content}}",
      "output_field": "analysis"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "top_k": 10
    }
  }
]

Stage Selection Guide

Goal	Recommended Stage
Find semantically similar documents	feature_search
Filter by metadata fields	attribute_filter
Filter by content meaning	llm_filter
Improve recall with query variations	query_expand
Get best relevance ranking	rerank
Order by price/date/rating	sort_attribute
Re-sort by relevance scores	sort_relevance
Diversify results	mmr
Normalize scores across sources	score_normalize
Truncate to top-N results	limit
Remove duplicate results	deduplicate
Expand array fields to documents	unwind
Answer questions from docs	summarize
Compute statistics on results	aggregate
Find themes in results	cluster
Group by category/author	group_by
Random/stratified sampling	sample
Add external API data	api_call
Add database data	sql_lookup
Join Mixpeek collections	document_enrich
Classify documents	taxonomy_enrich
Generate new fields with LLM	llm_enrich
Transform document structure	json_transform
Prepare for LLM context	rag_prepare
Custom code transformations	code_execution
Add web search results	external_web_search
Extract URL content	web_scrape

Performance Considerations

Stage	Typical Latency	Cost
feature_search	5-50ms	Index storage
attribute_filter	< 5ms	Free
llm_filter	200-500ms	LLM API
query_expand	300-800ms	LLM API
rerank	50-100ms	Inference
sort_attribute	< 5ms	Free
sort_relevance	< 5ms	Free
mmr	10-50ms	Free
score_normalize	< 1ms	Free
limit	< 1ms	Free
deduplicate	5-50ms	Free
unwind	< 5ms	Free
summarize	500-2000ms	LLM API
aggregate	5-50ms	Free
cluster	50-200ms	Inference
group_by	5-20ms	Free
sample	< 5ms	Free
llm_enrich	300-800ms	LLM API
api_call	50-500ms	External API
sql_lookup	10-100ms	Database
code_execution	5-50ms	Free
rag_prepare	< 10ms	Free
json_transform	< 5ms	Free
external_web_search	100-500ms	Exa API
taxonomy_enrich	20-100ms	Inference
document_enrich	10-50ms	Database
web_scrape	500-5000ms	External

Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.

Template Variables

All stages support template variables for dynamic configuration:

Variable	Description
`{{INPUT.*}}`	Input parameters from retriever call
`{{DOC.*}}`	Document fields (in APPLY/ENRICH stages)
`{{CONTEXT.*}}`	Pipeline context (index, citations)

{
  "stage_type": "filter",
  "stage_id": "attribute_filter",
  "parameters": {
    "field": "metadata.tenant_id",
    "operator": "eq",
    "value": "{{INPUT.tenant_id}}"
  }
}

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Retriever Stages

Stage Categories

FILTER

SORT

REDUCE

APPLY

ENRICH

All Stages

Filter Stages

Sort Stages

Reduce Stages

Apply Stages

Enrich Stages

Pipeline Patterns

Basic RAG Pipeline

E-Commerce Search

Research Assistant

Enriched Document Retrieval

Stage Selection Guide

Performance Considerations

Template Variables

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Stage Categories

FILTER

SORT

REDUCE

APPLY

ENRICH

​All Stages

​Filter Stages

​Sort Stages

​Reduce Stages

​Apply Stages

​Enrich Stages

​Pipeline Patterns

​Basic RAG Pipeline

​E-Commerce Search

​Research Assistant

​Enriched Document Retrieval

​Stage Selection Guide

​Performance Considerations

​Template Variables

Stage Categories

All Stages

Filter Stages

Sort Stages

Reduce Stages

Apply Stages

Enrich Stages

Pipeline Patterns

Basic RAG Pipeline

E-Commerce Search

Research Assistant

Enriched Document Retrieval

Stage Selection Guide

Performance Considerations

Template Variables