Skip to main content
Retriever stages are the building blocks of search pipelines in Mixpeek. Each stage performs a specific operation on the document set, allowing you to compose complex retrieval workflows from simple, reusable components.

Stage Categories

Stages are organized into five categories based on how they transform the document set:

FILTER

Reduce the document set by matching criteria. Outputs a subset of input documents.Stages: feature_search, attribute_filter, llm_filter, agent_search, query_expand

SORT

Reorder documents by relevance or field values. Same documents, different order.Stages: sort_relevance, sort_attribute, mmr, rerank, score_normalize

REDUCE

Aggregate or reduce the document count. Combine, group, or sample results.Stages: aggregate, group_by, cluster, sample, summarize, limit, deduplicate

APPLY

Enrich or transform documents. May add fields, create new documents, or restructure data.Stages: json_transform, rag_prepare, external_web_search, api_call, sql_lookup, llm_enrich, taxonomy_enrich, document_enrich, cross_compare, web_scrape, unwind

ENRICH

Execute custom code in isolated sandboxes for dynamic enrichments.Stages: code_execution

All Stages

Filter Stages

StageDescription
Feature SearchSearch by vector similarity using multimodal embeddings
Attribute FilterFilter by metadata fields with boolean logic (AND/OR/NOT)
LLM FilterSemantic filtering using LLM-based evaluation
Agent SearchLLM-driven multi-step retrieval with iterative reasoning and tool orchestration
Query ExpandLLM-powered query expansion with RRF result fusion

Sort Stages

StageDescription
Sort RelevanceReorder by relevance scores
Sort AttributeOrder by any metadata field (dates, price, etc.)
MMRDiversify results with Maximal Marginal Relevance
RerankRe-score with cross-encoder models (e.g., BGE reranker)
Score NormalizeRescale scores to a common range for consistent comparison

Reduce Stages

StageDescription
AggregateCompute COUNT, SUM, AVG, etc. on results
Group ByGroup documents by field value (decompose/recompose)
ClusterDiscover themes via embedding-based clustering
SampleRandom or stratified sampling of results
SummarizeCondense documents into an LLM-generated summary
LimitTruncate results to a maximum count with optional offset
DeduplicateRemove duplicate documents by field or content similarity

Apply Stages

StageDescription
JSON TransformReshape documents using Jinja2 templates
RAG PrepareFormat for LLM context with token management and citations
External Web SearchAugment with Exa AI-native web search
API CallEnrich with external REST API responses
SQL LookupJoin with PostgreSQL/Snowflake data
LLM EnrichGenerate new fields with LLM prompts
Taxonomy EnrichClassify documents against taxonomy nodes
Document EnrichCross-collection joins (LEFT JOIN)
Cross CompareMulti-tier cross-collection matching with classification
Web ScrapeExtract full page content from URLs
UnwindDecompose array fields into separate documents

Enrich Stages

StageDescription
Code ExecutionExecute Python/TypeScript/JavaScript in sandboxes

Pipeline Patterns

Basic RAG Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 50
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "document_field": "content",
      "top_k": 10
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "rag_prepare",
    "parameters": {
      "max_tokens": 8000,
      "output_mode": "single_context"
    }
  }
]
[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "attribute_filter",
    "parameters": {
      "AND": [
        {"field": "metadata.in_stock", "operator": "eq", "value": true},
        {"field": "metadata.price", "operator": "lte", "value": "{{INPUT.max_price}}"}
      ]
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "sort_attribute",
    "parameters": {
      "field": "metadata.{{INPUT.sort_by}}",
      "direction": "{{INPUT.sort_order}}"
    }
  }
]

Research Assistant

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 100
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "external_web_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "num_results": 10,
      "category": "research_paper"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "top_k": 15
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Synthesize findings on: {{INPUT.query}}"
    }
  }
]

Enriched Document Retrieval

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "feature_uris": [{"input": {"text": "{{INPUT.query}}"}, "uri": "mixpeek://text_extractor@v1/embedding"}],
      "limit": 20
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "document_enrich",
    "parameters": {
      "target_collection_id": "col_users",
      "source_field": "metadata.author_id",
      "target_field": "user_id",
      "output_field": "author"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrich",
    "parameters": {
      "provider": "openai",
      "model_name": "gpt-4o-mini",
      "prompt": "Extract key topics and entities from: {{DOC.content}}",
      "output_field": "analysis"
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "inference_name": "baai_bge_reranker_v2_m3",
      "query": "{{INPUT.query}}",
      "top_k": 10
    }
  }
]

Stage Selection Guide

GoalRecommended Stage
Find semantically similar documentsfeature_search
Filter by metadata fieldsattribute_filter
Filter by content meaningllm_filter
Improve recall with query variationsquery_expand
Get best relevance rankingrerank
Order by price/date/ratingsort_attribute
Re-sort by relevance scoressort_relevance
Diversify resultsmmr
Normalize scores across sourcesscore_normalize
Truncate to top-N resultslimit
Remove duplicate resultsdeduplicate
Expand array fields to documentsunwind
Answer questions from docssummarize
Compute statistics on resultsaggregate
Find themes in resultscluster
Group by category/authorgroup_by
Random/stratified samplingsample
Add external API dataapi_call
Add database datasql_lookup
Join Mixpeek collectionsdocument_enrich
Classify documentstaxonomy_enrich
Generate new fields with LLMllm_enrich
Transform document structurejson_transform
Prepare for LLM contextrag_prepare
Custom code transformationscode_execution
Add web search resultsexternal_web_search
Extract URL contentweb_scrape

Performance Considerations

StageTypical LatencyCost
feature_search5-50msIndex storage
attribute_filter< 5msFree
llm_filter200-500msLLM API
query_expand300-800msLLM API
rerank50-100msInference
sort_attribute< 5msFree
sort_relevance< 5msFree
mmr10-50msFree
score_normalize< 1msFree
limit< 1msFree
deduplicate5-50msFree
unwind< 5msFree
summarize500-2000msLLM API
aggregate5-50msFree
cluster50-200msInference
group_by5-20msFree
sample< 5msFree
llm_enrich300-800msLLM API
api_call50-500msExternal API
sql_lookup10-100msDatabase
code_execution5-50msFree
rag_prepare< 10msFree
json_transform< 5msFree
external_web_search100-500msExa API
taxonomy_enrich20-100msInference
document_enrich10-50msDatabase
web_scrape500-5000msExternal
Order stages efficiently: cheap operations (filters, sorts) before expensive ones (rerank, LLM calls). This reduces the document count before costly processing.

Template Variables

All stages support template variables for dynamic configuration:
VariableDescription
{{INPUT.*}}Input parameters from retriever call
{{DOC.*}}Document fields (in APPLY/ENRICH stages)
{{CONTEXT.*}}Pipeline context (index, citations)
{
  "stage_type": "filter",
  "stage_id": "attribute_filter",
  "parameters": {
    "field": "metadata.tenant_id",
    "operator": "eq",
    "value": "{{INPUT.tenant_id}}"
  }
}