Skip to main content
Mixpeek Retrievers
Retrievers combine feature-aware search stages, structured filters, enrichment joins, and optional LLM post-processing into a single executable pipeline. Each retriever has an input schema, a list of target collections, and a deterministic set of stages executed in order.

Anatomy of a Retriever

{
  "retriever_name": "product_search_v2",
  "description": "Product search with enrichment and transformation",
  "collection_ids": ["col_products"],
  "input_schema": {
    "properties": {
      "query_text": { "type": "text", "required": true },
      "max_price": { "type": "number" }
    }
  },
  "stages": [
    {
      "stage_type": "apply",
      "stage_id": "document_enrich",
      "parameters": {
        "target_collection_id": "col_catalog",
        "source_field": "metadata.product_id",
        "target_field": "product_id",
        "fields_to_merge": ["name", "price", "category"],
        "output_field": "catalog_data"
      }
    },
    {
      "stage_type": "apply",
      "stage_id": "api_call",
      "parameters": {
        "url": "https://api.stripe.com/v1/customers/{{DOC.metadata.customer_id}}",
        "method": "GET",
        "allowed_domains": ["api.stripe.com"],
        "auth": {
          "type": "bearer",
          "secret_ref": "stripe_api_key"
        },
        "output_field": "metadata.billing",
        "on_error": "skip"
      }
    },
    {
      "stage_type": "apply",
      "stage_id": "json_transform",
      "parameters": {
        "template": "{\"id\": \"{{DOC.document_id}}\", \"title\": \"{{DOC.metadata.title}}\", \"price\": {{DOC.catalog_data.price}}}",
        "fail_on_error": false
      }
    }
  ],
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300
  }
}

Stage Catalog

Stages are the building blocks of retriever pipelines. Each stage belongs to a category that defines its behavior:
CategoryBehaviorExample Use Cases
filterReduce the number of documents while preserving schemaAttribute filters, semantic search, hybrid search
sortReorder documents without changing the setAttribute sort, score-based ordering, reranking
reduceAggregate to a smaller set of documentsTop-k selection, clustering reducers, deduplication
applyEnrich or transform documents without dropping themTaxonomy joins, API enrichment, LLM enrichers, JSON transforms
Retrieve the live registry with GET /v1/retrievers/stages. Each entry includes stage_id, category, icon, and parameter schema so you can dynamically build configuration UIs or validations.Live stages: https://api.mixpeek.com/v1/retrievers/stages
curl -s --request GET \
  --url "$MP_API_URL/v1/retrievers/stages" \
  --header "Authorization: Bearer $MP_API_KEY" \
  --header "X-Namespace: $MP_NAMESPACE"

Filter Stages

Filter stages reduce the document set while preserving the document schema. Use these at the start of your pipeline to narrow down candidates.
Use GET /v1/retrievers/stages?category=filter to retrieve the current list of filter stages and their parameter schemas.

Sort Stages

Sort stages reorder documents without changing the result set. Place these after filters to control ranking.
Use GET /v1/retrievers/stages?category=sort to retrieve the current list of sort stages and their parameter schemas.

Reduce Stages

Reduce stages aggregate documents to a smaller set. Use these for deduplication or clustering.
Use GET /v1/retrievers/stages?category=reduce to retrieve the current list of reduce stages and their parameter schemas.

Apply Stages

Apply stages enrich or transform documents. Use these to add context, join data, or reshape output.
Stage IDDescriptionTransformation
document_enrichJoin documents with data from another collectionN → N (LEFT JOIN)
api_callEnrich documents with external API callsN → N
json_transformTransform document structure using Jinja2 templatesN → N
external_web_searchSearch the web using Exa AI-native search0 → M (creates documents)

Apply Stage Details

document_enrich

Joins documents with data from another collection, similar to a SQL LEFT JOIN. Each input document produces exactly one output document with added fields from the target collection. When to use:
  • Combine data from multiple collections (e.g., products + catalog info)
  • Attach user profiles, metadata, or related entities
  • Denormalize data at query time
Parameters:
ParameterRequiredDescription
target_collection_idYesCollection to join with
source_fieldYes*Field in current documents to match
target_fieldYes*Field in target collection to match against
fields_to_mergeNoSpecific fields to merge (or entire document if omitted)
output_fieldNoWhere to place enrichment (root or nested path)
retriever_idNoUse an existing retriever for lookup instead of direct field matching
retriever_configNoAnonymous retriever definition for complex lookups
retriever_inputsNoTemplate inputs when using retriever-based enrichment
strategyNoenrich (merge fields) or append (add as nested object)
allow_missingNoKeep documents without matches (default: true)
whenNoConditional filter for selective enrichment
cache_behaviorNoauto, disabled, or aggressive
cache_ttl_secondsNoCache TTL in seconds
*Required for direct joins; not needed when using retriever_id or retriever_config. Examples:
{
  "stage_type": "apply",
  "stage_id": "document_enrich",
  "parameters": {
    "target_collection_id": "col_products",
    "source_field": "metadata.product_id",
    "target_field": "product_id",
    "fields_to_merge": ["name", "price", "category"],
    "output_field": "product_data"
  }
}

api_call

Enriches documents by calling external HTTP APIs. Enables integration with third-party services (Stripe, GitHub, weather APIs, etc.) to augment documents with real-time data.
Security: This stage makes external HTTP requests. Always use allowed_domains to prevent SSRF attacks. Never store credentials directly—use auth.secret_ref to reference vault-stored secrets.
Parameters:
ParameterRequiredDescription
urlYesAPI endpoint URL (supports {DOC.field} and {INPUT.field} templates)
allowed_domainsYesDomain allowlist for SSRF protection (never use *)
output_fieldYesDot-path where API response should be stored
methodNoHTTP method: GET, POST, PUT, PATCH, DELETE (default: GET)
authNoAuthentication configuration (see below)
headersNoAdditional HTTP headers
bodyNoRequest body for POST/PUT/PATCH (JSON, supports templates)
timeoutNoRequest timeout in seconds (1-60, default: 10)
max_response_sizeNoMaximum response size in bytes (default: 10MB)
response_pathNoJSONPath to extract specific field from response
rate_limitNoRate limiting config (requests_per_minute, requests_per_hour)
whenNoConditional filter for selective enrichment
on_errorNoError handling: skip, remove, or raise (default: skip)
Authentication Types:
TypeDescriptionRequired Fields
noneNo authentication (public APIs)
bearerBearer token (OAuth 2.0, JWT)secret_ref
api_keyAPI key in header or query paramsecret_ref, key, location (header/query)
basicHTTP Basic Auth (username:password in secret)secret_ref
custom_headerCustom header with arbitrary namesecret_ref, key
Examples:
{
  "stage_type": "apply",
  "stage_id": "api_call",
  "parameters": {
    "url": "https://api.stripe.com/v1/customers/{DOC.metadata.stripe_id}",
    "method": "GET",
    "allowed_domains": ["api.stripe.com"],
    "auth": {
      "type": "bearer",
      "secret_ref": "stripe_api_key"
    },
    "output_field": "metadata.stripe_data",
    "timeout": 10,
    "on_error": "skip"
  }
}

json_transform

Applies a Jinja2 template to each document, rendering the template with full document context and replacing the document with the parsed JSON output. Use this to reformat documents for external APIs or reshape data for downstream consumers. Parameters:
ParameterRequiredDescription
templateYesJinja2 template string that must render to valid JSON
fail_on_errorNoFail entire pipeline on transformation error (default: false)
Template Context:
NamespaceDescription
DOC / docCurrent document fields and metadata
INPUT / inputsOriginal query inputs from the search request
CONTEXT / contextExecution context (namespace_id, internal_id, etc.)
STAGE / stageCurrent stage execution data
Examples:
{
  "stage_type": "apply",
  "stage_id": "json_transform",
  "parameters": {
    "template": "{\"id\": \"{{ DOC.document_id }}\", \"content\": \"{{ DOC.text }}\", \"score\": {{ DOC.score }}}"
  }
}

Performs AI-native web search using Exa’s neural ranking system. Creates new documents from web search results, enabling retriever pipelines to incorporate real-time internet content.
This stage creates new documents (0 → M transformation) rather than enriching existing ones. Use it at the start of a pipeline or to augment internal results with external web sources.
Parameters:
ParameterRequiredDescription
queryYesSearch query (supports {INPUT.field} and {DOC.field} templates)
num_resultsNoNumber of results (1-100, default: 10)
use_autopromptNoEnable Exa’s query enhancement (default: true)
start_published_dateNoFilter by publication date (YYYY-MM-DD format)
categoryNoContent category: research paper, news, github, tweet, blog, company, pdf
include_textNoInclude text snippets in results (default: true)
Output Schema: Each result becomes a document with:
  • metadata.url – Web page URL
  • metadata.title – Page title
  • metadata.text – Text snippet (if include_text=true)
  • metadata.published_date – Publication date (if available)
  • metadata.author – Author name (if available)
  • metadata.search_query – Original query used
  • metadata.search_position – 0-indexed position in results
  • score – Exa relevance score
Examples:
{
  "stage_type": "apply",
  "stage_id": "external_web_search",
  "parameters": {
    "query": "{INPUT.query}",
    "num_results": 10,
    "include_text": true,
    "use_autoprompt": true
  }
}

Call GET /v1/retrievers/stages to retrieve the latest stage metadata and parameter schemas.

Execution Lifecycle

  1. Validate Inputs – Mixpeek enforces the retriever’s input_schema.
  2. Walk Stages – Each stage receives the current working set, runs, and outputs a new set.
  3. Apply Paginationlimit, offset, cursor, or keyset pagination is handled after the final stage.
  4. Return Telemetry – Responses include stage_statistics, budget, and optional presigned URLs.
Response headers include:
  • ETag – cache validator; pair with If-None-Match for 304 responses.
  • Cache-Control – TTL derived from cache_config.
  • X-CacheHIT or MISS for query-level caching.

Filters & Templates

Structured filters support comparison operators (eq, gt, lte, in, etc.) and logical composition (AND, OR, NOT).

Template Namespaces

Stages support dynamic configuration through template expressions using Jinja2 syntax. Both uppercase and lowercase namespace formats are supported and work identically:
NamespaceDescriptionExamples
INPUT / inputsUser-provided query parameters and inputs{{INPUT.query_text}}, {{inputs.max_price}}
DOC / docCurrent document fields (for per-document logic){{DOC.metadata.category}}, {{doc.content_type}}
CONTEXT / contextExecution state (budget, timing, retriever metadata){{CONTEXT.budget_remaining}}, {{context.time_elapsed_ms}}
STAGE / stagePrevious stage outputs (for cascading logic){{STAGE.hybrid_search.top_score}}, {{stage.filter.count}}
Mixed usage within the same stage is supported. For example, you can use {{INPUT.query}} alongside {{context.budget_remaining}} in the same configuration.
Conditional expressions:
{
  "batch_size": "{{CONTEXT.budget_remaining > 50 ? 200 : 50}}",
  "field": "{{DOC.media_type == 'image' ? 'image_url' : 'video_url'}}"
}
Templated batch size:
{
  "batch_size": "{{20 * inputs.page_size}}"
}

Retrievers & Caching

  • Query cache – caches entire responses keyed by inputs, filters, pagination, and collection index signatures.
  • Stage cache – reuse outputs of expensive stages by listing them under cache_stage_names.
  • Inference cache – Engine deduplicates identical model calls.
Use GET /v1/analytics/retrievers/{id}/cache-performance to monitor hit rates and latency improvements.

Pagination Options

MethodUse Case
offsetSimple pagination, supports limit + offset
cursorStable iteration over large result sets
scrollDeep pagination for analytics workloads
keysetHigh-performance paginated browsing (requires sort key)
Specify the method in pagination.method when executing a retriever.

Execute a Retriever

curl -sS -X POST "$MP_API_URL/v1/retrievers/<retriever_id>/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": {
      "query_text": "wireless earbuds",
      "max_price": 150
    },
    "filters": {
      "field": "metadata.category",
      "operator": "eq",
      "value": "audio"
    },
    "limit": 10,
    "return_urls": true,
    "return_vectors": false,
    "session_id": "sess_123"
  }'
Response snippet:
{
  "execution_id": "exec_b8f31e0c",
  "documents": [...],
  "stage_statistics": {
    "hybrid_search": { "duration_ms": 180, "cache_hit": true },
    "filter": { "duration_ms": 8 },
    "rerank": { "duration_ms": 120 }
  },
  "budget": {
    "credits_used": 12.4,
    "credits_limit": 100,
    "time_elapsed_ms": 310
  }
}

Maintenance & Versioning

  • Use PATCH /v1/retrievers/{id} to rename retrievers or adjust cache settings (stages and schema are immutable; create a new retriever for breaking changes).
  • List retrievers with filters, search, and sort: POST /v1/retrievers/list.
  • Retrieve execution history: GET /v1/retrievers/{id}/executions.
  • Diagnose pipelines without executing: POST /v1/retrievers/{id}/explain.

Interaction Feedback

Capture user feedback with /v1/retrievers/interactions to power downstream analytics, learning-to-rank, or personalized retrieval:
{
  "feature_id": "doc_abc123",
  "interaction_type": ["click", "long_view"],
  "position": 2,
  "metadata": { "duration_ms": 12000 },
  "user_id": "user_456",
  "session_id": "sess_xyz789"
}

Best Practices

  1. Start narrow – run a single search stage before adding rerankers or joins.
  2. Push filters early – stage-level filters shrink the candidate set before expensive operations.
  3. Use JOIN strategies wiselydirect for key-based joins, retriever for similarity joins; set join_strategy to control merge behavior.
  4. Enable caching – stage caching plus query caching dramatically reduces latency for repeat queries.
  5. Monitor analytics – use retriever analytics endpoints to optimize parameters, detect slow stages, and understand cache ROI.
Retrievers turn Mixpeek’s primitives—features, taxonomies, clusters, and models—into end-user search experiences. Configure once, execute anywhere, and evolve the pipeline with confidence.