Skip to main content
Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components.

Stage Categories

Stages are organized into four categories based on how they transform the document set:

FILTER

Reduce the document set by matching criteria. Outputs a subset of input documents.Examples: semantic_search, hybrid_search, structured_filter, llm_filter, query_expand

SORT

Reorder documents by relevance or field values. Same documents, different order.Examples: rerank, sort_by_field, sort_relevance, mmr, learned_rerank

REDUCE

Aggregate or reduce the document count. Combine, deduplicate, or limit results.Examples: summarize, aggregate, cluster, group_by, sample

APPLY

Enrich documents with additional data. Same count, added fields.Examples: api_call, llm_enrichment, document_enrich, code_execution, rag_prepare

All Stages

Search & Filter Stages

StageCategoryDescription
Semantic SearchFILTERVector similarity search using embeddings
Hybrid SearchFILTERCombined vector + text search with RRF
Structured FilterFILTERFilter by metadata fields and conditions
LLM FilterFILTERContent-based filtering using LLMs
Query ExpandFILTERLLM-powered query expansion with result fusion

Sorting Stages

StageCategoryDescription
RerankSORTNeural re-scoring with cross-encoders
Sort By FieldSORTOrder by any metadata field
Sort RelevanceSORTReorder by relevance scores
MMRSORTDiversify results with Maximal Marginal Relevance
Learned RerankSORTPersonalized reranking with bandit learning

Reduction Stages

StageCategoryDescription
SummarizeREDUCELLM-powered document summarization
AggregateREDUCECompute statistical aggregations
ClusterREDUCEGroup documents by embedding similarity
Group ByREDUCEAggregate documents by field values
SampleREDUCERandom or stratified sampling

Enrichment Stages

StageCategoryDescription
API CallAPPLYEnrich with external REST APIs
SQL LookupAPPLYJoin with SQL database data
Document EnrichAPPLYCross-collection joins
Taxonomy EnrichAPPLYClassify against taxonomies
LLM EnrichmentAPPLYExtract structured data with LLMs
JSON TransformAPPLYTemplate-based transformations
RAG PrepareAPPLYFormat for LLM context windows
Code ExecutionAPPLYExecute custom Python code
External Web SearchAPPLYAugment with Exa web search
Web ScrapeAPPLYExtract content from URLs

Pipeline Patterns

Basic RAG Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 10
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "rag_prepare",
    "parameters": {
      "max_tokens": 8000
    }
  }
]
[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100,
      "vector_weight": 0.6,
      "text_weight": 0.4
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "AND": [
          {"field": "metadata.in_stock", "operator": "eq", "value": true},
          {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
        ]
      }
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "sort_by_field",
    "parameters": {
      "sort_field": "metadata.{{INPUT.sort_by}}",
      "order": "{{INPUT.sort_order}}"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "limit",
    "parameters": {
      "limit": 20
    }
  }
]

Research Assistant

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 10,
      "category": "research_paper"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "deduplicate",
    "parameters": {
      "method": "semantic",
      "similarity_threshold": 0.9
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "cohere-rerank-v3",
      "top_n": 15
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Synthesize findings on: {{INPUT.query}}"
    }
  }
]

Enriched Document Retrieval

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 20
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "document_enrich",
    "parameters": {
      "collection_id": "users",
      "lookup_field": "user_id",
      "source_field": "metadata.author_id",
      "result_field": "author"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrichment",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Extract key topics and entities",
      "output_field": "analysis"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 10
    }
  }
]

Stage Selection Guide

GoalRecommended Stage
Find semantically similar documentssemantic_search
Match exact keywords + meaninghybrid_search
Filter by metadata fieldsstructured_filter
Filter by content meaningllm_filter
Improve recall with query variationsquery_expand
Get best relevance rankingrerank
Order by price/date/ratingsort_by_field
Re-sort by relevance scoressort_relevance
Diversify resultsmmr
Personalized rankinglearned_rerank
Answer questions from docssummarize
Compute statistics on resultsaggregate
Find themes in resultscluster
Group by category/authorgroup_by
Random/stratified samplingsample
Add external API dataapi_call
Add database datasql_lookup
Join Mixpeek collectionsdocument_enrich
Classify documentstaxonomy_enrich
Extract structured datallm_enrichment
Transform document structurejson_transform
Prepare for LLM contextrag_prepare
Custom transformationscode_execution
Add web search resultsexternal_web_search
Extract URL contentweb_scrape

Performance Considerations

Stage TypeTypical LatencyCost
semantic_search5-50msIndex storage
hybrid_search20-100msIndex storage
structured_filter< 5msFree
llm_filter200-500msLLM API
query_expand300-800msLLM API
rerank50-100msAPI calls
sort_by_field< 5msFree
sort_relevance< 5msFree
mmr10-50msFree
learned_rerank20-50msFree
summarize500-2000msLLM API
aggregate5-50msFree
cluster50-200msFree
group_by5-20msFree
sample< 5msFree
llm_enrichment300-800msLLM API
api_call50-500msExternal API
sql_lookup10-100msDatabase
code_execution5-50msFree
rag_prepare< 10msFree
Order stages efficiently: cheap operations (filters, limits) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.

Template Variables

All stages support template variables for dynamic configuration:
VariableDescription
{{INPUT.*}}Input parameters from retriever call
{{DOC.*}}Document fields (in APPLY stages)
{{CONTEXT.*}}Pipeline context (index, citations)
{
  "stage_type": "filter",
  "stage_id": "structured_filter",
  "parameters": {
    "conditions": {
      "field": "metadata.tenant_id",
      "operator": "eq",
      "value": "{{INPUT.tenant_id}}"
    }
  }
}