Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components.
Stage Categories
Stages are organized into five categories based on how they transform the document set:
FILTER Reduce the document set by matching criteria. Outputs a subset of input documents. Stages : feature_search, attribute_filter, llm_filter, agent_search, query_expand
SORT Reorder documents by relevance or field values. Same documents, different order. Stages : sort_relevance, sort_attribute, mmr, rerank, score_normalize
REDUCE Aggregate or reduce the document count. Combine, group, or sample results. Stages : aggregate, group_by, cluster, sample, summarize, limit, deduplicate
APPLY Enrich or transform documents. May add fields, create new documents, or restructure data. Stages : json_transform, rag_prepare, external_web_search, api_call, sql_lookup, llm_enrich, taxonomy_enrich, document_enrich, cross_compare, web_scrape, unwind
ENRICH Execute custom code in isolated sandboxes for dynamic enrichments. Stages : code_execution
All Stages
Filter Stages
Stage Description Feature Search Search by vector similarity using multimodal embeddings Attribute Filter Filter by metadata fields with boolean logic (AND/OR/NOT) LLM Filter Semantic filtering using LLM-based evaluation Agent Search LLM-driven multi-step retrieval with iterative reasoning and tool orchestration Query Expand LLM-powered query expansion with RRF result fusion
Sort Stages
Stage Description Sort Relevance Reorder by relevance scores Sort Attribute Order by any metadata field (dates, price, etc.) MMR Diversify results with Maximal Marginal Relevance Rerank Re-score with cross-encoder models (e.g., BGE reranker) Score Normalize Rescale scores to a common range for consistent comparison
Reduce Stages
Stage Description Aggregate Compute COUNT, SUM, AVG, etc. on results Group By Group documents by field value (decompose/recompose) Cluster Discover themes via embedding-based clustering Sample Random or stratified sampling of results Summarize Condense documents into an LLM-generated summary Limit Truncate results to a maximum count with optional offset Deduplicate Remove duplicate documents by field or content similarity
Apply Stages
Stage Description JSON Transform Reshape documents using Jinja2 templates RAG Prepare Format for LLM context with token management and citations External Web Search Augment with Exa AI-native web search API Call Enrich with external REST API responses SQL Lookup Join with PostgreSQL/Snowflake data LLM Enrich Generate new fields with LLM prompts Taxonomy Enrich Classify documents against taxonomy nodes Document Enrich Cross-collection joins (LEFT JOIN) Cross Compare Multi-tier cross-collection matching with classification Web Scrape Extract full page content from URLs Unwind Decompose array fields into separate documents
Enrich Stages
Stage Description Code Execution Execute Python/TypeScript/JavaScript in sandboxes
Pipeline Patterns
Basic RAG Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 50
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"inference_name" : "baai_bge_reranker_v2_m3" ,
"query" : "{{INPUT.query}}" ,
"document_field" : "content" ,
"top_k" : 10
}
},
{
"stage_type" : "apply" ,
"stage_id" : "rag_prepare" ,
"parameters" : {
"max_tokens" : 8000 ,
"output_mode" : "single_context"
}
}
]
E-Commerce Search
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 100
}
},
{
"stage_type" : "filter" ,
"stage_id" : "attribute_filter" ,
"parameters" : {
"AND" : [
{ "field" : "metadata.in_stock" , "operator" : "eq" , "value" : true },
{ "field" : "metadata.price" , "operator" : "lte" , "value" : "{{INPUT.max_price}}" }
]
}
},
{
"stage_type" : "sort" ,
"stage_id" : "sort_attribute" ,
"parameters" : {
"field" : "metadata.{{INPUT.sort_by}}" ,
"direction" : "{{INPUT.sort_order}}"
}
}
]
Research Assistant
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 100
}
},
{
"stage_type" : "apply" ,
"stage_id" : "external_web_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"num_results" : 10 ,
"category" : "research_paper"
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"inference_name" : "baai_bge_reranker_v2_m3" ,
"query" : "{{INPUT.query}}" ,
"top_k" : 15
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "summarize" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Synthesize findings on: {{INPUT.query}}"
}
}
]
Enriched Document Retrieval
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"feature_uris" : [{ "input" : { "text" : "{{INPUT.query}}" }, "uri" : "mixpeek://text_extractor@v1/embedding" }],
"limit" : 20
}
},
{
"stage_type" : "apply" ,
"stage_id" : "document_enrich" ,
"parameters" : {
"target_collection_id" : "col_users" ,
"source_field" : "metadata.author_id" ,
"target_field" : "user_id" ,
"output_field" : "author"
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"provider" : "openai" ,
"model_name" : "gpt-4o-mini" ,
"prompt" : "Extract key topics and entities from: {{DOC.content}}" ,
"output_field" : "analysis"
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"inference_name" : "baai_bge_reranker_v2_m3" ,
"query" : "{{INPUT.query}}" ,
"top_k" : 10
}
}
]
Stage Selection Guide
Goal Recommended Stage Find semantically similar documents feature_search Filter by metadata fields attribute_filter Filter by content meaning llm_filter Improve recall with query variations query_expand Get best relevance ranking rerank Order by price/date/rating sort_attribute Re-sort by relevance scores sort_relevance Diversify results mmr Normalize scores across sources score_normalize Truncate to top-N results limit Remove duplicate results deduplicate Expand array fields to documents unwind Answer questions from docs summarize Compute statistics on results aggregate Find themes in results cluster Group by category/author group_by Random/stratified sampling sample Add external API data api_call Add database data sql_lookup Join Mixpeek collections document_enrich Classify documents taxonomy_enrich Generate new fields with LLM llm_enrich Transform document structure json_transform Prepare for LLM context rag_prepare Custom code transformations code_execution Add web search results external_web_search Extract URL content web_scrape
Stage Typical Latency Cost feature_search 5-50ms Index storage attribute_filter < 5ms Free llm_filter 200-500ms LLM API query_expand 300-800ms LLM API rerank 50-100ms Inference sort_attribute < 5ms Free sort_relevance < 5ms Free mmr 10-50ms Free score_normalize < 1ms Free limit < 1ms Free deduplicate 5-50ms Free unwind < 5ms Free summarize 500-2000ms LLM API aggregate 5-50ms Free cluster 50-200ms Inference group_by 5-20ms Free sample < 5ms Free llm_enrich 300-800ms LLM API api_call 50-500ms External API sql_lookup 10-100ms Database code_execution 5-50ms Free rag_prepare < 10ms Free json_transform < 5ms Free external_web_search 100-500ms Exa API taxonomy_enrich 20-100ms Inference document_enrich 10-50ms Database web_scrape 500-5000ms External
Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.
Template Variables
All stages support template variables for dynamic configuration:
Variable Description {{INPUT.*}}Input parameters from retriever call {{DOC.*}}Document fields (in APPLY/ENRICH stages) {{CONTEXT.*}}Pipeline context (index, citations)
{
"stage_type" : "filter" ,
"stage_id" : "attribute_filter" ,
"parameters" : {
"field" : "metadata.tenant_id" ,
"operator" : "eq" ,
"value" : "{{INPUT.tenant_id}}"
}
}