Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components.
Stage Categories
Stages are organized into four categories based on how they transform the document set:
FILTER Reduce the document set by matching criteria. Outputs a subset of input documents. Examples : semantic_search, hybrid_search, structured_filter, llm_filter, query_expand
SORT Reorder documents by relevance or field values. Same documents, different order. Examples : rerank, sort_by_field, sort_relevance, mmr, learned_rerank
REDUCE Aggregate or reduce the document count. Combine, deduplicate, or limit results. Examples : summarize, aggregate, cluster, group_by, sample
APPLY Enrich documents with additional data. Same count, added fields. Examples : api_call, llm_enrichment, document_enrich, code_execution, rag_prepare
All Stages
Search & Filter Stages
Stage Category Description Semantic Search FILTER Vector similarity search using embeddings Hybrid Search FILTER Combined vector + text search with RRF Structured Filter FILTER Filter by metadata fields and conditions LLM Filter FILTER Content-based filtering using LLMs Query Expand FILTER LLM-powered query expansion with result fusion
Sorting Stages
Stage Category Description Rerank SORT Neural re-scoring with cross-encoders Sort By Field SORT Order by any metadata field Sort Relevance SORT Reorder by relevance scores MMR SORT Diversify results with Maximal Marginal Relevance Learned Rerank SORT Personalized reranking with bandit learning
Reduction Stages
Stage Category Description Summarize REDUCE LLM-powered document summarization Aggregate REDUCE Compute statistical aggregations Cluster REDUCE Group documents by embedding similarity Group By REDUCE Aggregate documents by field values Sample REDUCE Random or stratified sampling
Enrichment Stages
Stage Category Description API Call APPLY Enrich with external REST APIs SQL Lookup APPLY Join with SQL database data Document Enrich APPLY Cross-collection joins Taxonomy Enrich APPLY Classify against taxonomies LLM Enrichment APPLY Extract structured data with LLMs JSON Transform APPLY Template-based transformations RAG Prepare APPLY Format for LLM context windows Code Execution APPLY Execute custom Python code External Web Search APPLY Augment with Exa web search Web Scrape APPLY Extract content from URLs
Pipeline Patterns
Basic RAG Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 50
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "bge-reranker-v2-m3" ,
"top_n" : 10
}
},
{
"stage_type" : "apply" ,
"stage_id" : "rag_prepare" ,
"parameters" : {
"max_tokens" : 8000
}
}
]
E-Commerce Search
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100 ,
"vector_weight" : 0.6 ,
"text_weight" : 0.4
}
},
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"AND" : [
{ "field" : "metadata.in_stock" , "operator" : "eq" , "value" : true },
{ "field" : "metadata.price" , "operator" : "lte" , "value" : "{{INPUT.max_price}}" }
]
}
}
},
{
"stage_type" : "sort" ,
"stage_id" : "sort_by_field" ,
"parameters" : {
"sort_field" : "metadata.{{INPUT.sort_by}}" ,
"order" : "{{INPUT.sort_order}}"
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "limit" ,
"parameters" : {
"limit" : 20
}
}
]
Research Assistant
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 100
}
},
{
"stage_type" : "apply" ,
"stage_id" : "external_web_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"num_results" : 10 ,
"category" : "research_paper"
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "deduplicate" ,
"parameters" : {
"method" : "semantic" ,
"similarity_threshold" : 0.9
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "cohere-rerank-v3" ,
"top_n" : 15
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "summarize" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Synthesize findings on: {{INPUT.query}}"
}
}
]
Enriched Document Retrieval
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 20
}
},
{
"stage_type" : "apply" ,
"stage_id" : "document_enrich" ,
"parameters" : {
"collection_id" : "users" ,
"lookup_field" : "user_id" ,
"source_field" : "metadata.author_id" ,
"result_field" : "author"
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrichment" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract key topics and entities" ,
"output_field" : "analysis"
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "bge-reranker-v2-m3" ,
"top_n" : 10
}
}
]
Stage Selection Guide
Goal Recommended Stage Find semantically similar documents semantic_search Match exact keywords + meaning hybrid_search Filter by metadata fields structured_filter Filter by content meaning llm_filter Improve recall with query variations query_expand Get best relevance ranking rerank Order by price/date/rating sort_by_field Re-sort by relevance scores sort_relevance Diversify results mmr Personalized ranking learned_rerank Answer questions from docs summarize Compute statistics on results aggregate Find themes in results cluster Group by category/author group_by Random/stratified sampling sample Add external API data api_call Add database data sql_lookup Join Mixpeek collections document_enrich Classify documents taxonomy_enrich Extract structured data llm_enrichment Transform document structure json_transform Prepare for LLM context rag_prepare Custom transformations code_execution Add web search results external_web_search Extract URL content web_scrape
Stage Type Typical Latency Cost semantic_search 5-50ms Index storage hybrid_search 20-100ms Index storage structured_filter < 5ms Free llm_filter 200-500ms LLM API query_expand 300-800ms LLM API rerank 50-100ms API calls sort_by_field < 5ms Free sort_relevance < 5ms Free mmr 10-50ms Free learned_rerank 20-50ms Free summarize 500-2000ms LLM API aggregate 5-50ms Free cluster 50-200ms Free group_by 5-20ms Free sample < 5ms Free llm_enrichment 300-800ms LLM API api_call 50-500ms External API sql_lookup 10-100ms Database code_execution 5-50ms Free rag_prepare < 10ms Free
Order stages efficiently: cheap operations (filters, limits) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.
Template Variables
All stages support template variables for dynamic configuration:
Variable Description {{INPUT.*}}Input parameters from retriever call {{DOC.*}}Document fields (in APPLY stages) {{CONTEXT.*}}Pipeline context (index, citations)
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.tenant_id" ,
"operator" : "eq" ,
"value" : "{{INPUT.tenant_id}}"
}
}
}