Skip to main content
Effective caching reduces latency, lowers inference costs, and improves user experience. Mixpeek supports retriever-level and stage-level caching with configurable TTLs and invalidation strategies.

Cache Architecture

Cache keys are computed from:
  • Retriever ID
  • Input parameters (normalized)
  • Filters and sort criteria
  • Stage configurations (if stage-level caching)

Cache Levels

Retriever-Level Caching

Caches entire execution results:
{
  "retriever_name": "product-search",
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 300,
    "cache_key_fields": ["inputs.query", "filters"]
  }
}
When to use:
  • Identical queries are common (e.g., trending searches, popular categories)
  • Expensive pipelines with multiple stages
  • Results don’t change frequently

Stage-Level Caching

Caches specific stage outputs within a pipeline:
{
  "cache_config": {
    "enabled": true,
    "cache_stage_names": ["web_search", "llm_generation"]
  }
}
When to use:
  • Some stages are expensive (LLM, web search) but others must be fresh (filters, sorts)
  • Partial cache hits provide value
  • Different queries share intermediate results

Feature Store Caching

Embeddings are cached implicitly in Qdrant vectors. Re-embedding identical text is avoided automatically.

TTL Selection Guide

Use CaseRecommended TTLRationale
Real-time dashboards10-30 secondsBalance freshness vs load
Product search5-15 minutesInventory changes slowly
Customer support KB30-60 minutesDocumentation is stable
News/trending content1-2 minutesContent updates frequently
LLM-generated summaries1-24 hoursExpensive to regenerate
Exploratory research5-10 minutesUsers iterate rapidly

Cache Key Normalization

Automatic Normalization

Mixpeek normalizes cache keys to maximize hit rate:
# These produce the same cache key:
query1 = "laptop computers"
query2 = "Laptop Computers"
query3 = "  laptop computers  "

# Normalized to: "laptop computers"

Custom Normalization

Control which inputs affect cache keys:
{
  "cache_config": {
    "cache_key_fields": ["inputs.query"],
    "ignore_fields": ["inputs.session_id", "return_urls"]
  }
}
Example: Two queries with different session_id but identical query share the same cache.

Invalidation Strategies

Time-Based (TTL)

Default strategy. Cache entries expire after TTL:
{
  "cache_config": {
    "ttl_seconds": 600
  }
}

Manual Invalidation

Invalidate specific retriever caches:
DELETE /v1/retrievers/{retriever_id}/cache
Triggers:
  • After reindexing documents
  • After collection schema updates
  • After taxonomy/cluster enrichment

Event-Driven Invalidation

Use webhooks to invalidate on document updates:
POST /v1/organizations/webhooks
{
  "event_types": ["document.created", "document.updated", "document.deleted"],
  "url": "https://your-app.com/webhooks/cache-invalidate",
  "filters": {
    "collection_ids": ["col_products"]
  }
}
In your webhook handler:
@app.post("/webhooks/cache-invalidate")
def invalidate_cache(event):
    if event["type"] == "document.updated":
        retriever_id = event["metadata"]["retriever_id"]
        mixpeek.retrievers.invalidate_cache(retriever_id)

Optimizing Cache Hit Rate

1. Pre-Warm Cache

For predictable queries, warm cache before peak traffic:
popular_queries = ["wireless headphones", "laptop stand", "USB-C cable"]

for query in popular_queries:
    mixpeek.retrievers.execute(
        retriever_id="product-search",
        inputs={"query": query}
    )
Schedule as a cron job during off-peak hours.

2. Query Canonicalization

Normalize queries before submission:
def canonicalize_query(query: str) -> str:
    return query.lower().strip().replace("  ", " ")

query = canonicalize_query(user_input)
result = mixpeek.retrievers.execute(retriever_id, inputs={"query": query})

3. Aggregate Similar Queries

Use query clustering to map variants to canonical forms:
query_map = {
    "laptop": ["laptop", "notebook", "laptop computer"],
    "headphones": ["headphones", "headphone", "earphones"]
}

canonical = query_map.get(user_query, user_query)

4. Cache Partial Results

For multi-stage pipelines, cache expensive stages even if final results differ:
{
  "stages": [
    {
      "stage_name": "web_search",
      "cache_ttl_seconds": 3600  // Cache web results for 1 hour
    },
    {
      "stage_name": "filter",
      "cache_ttl_seconds": 0  // Don't cache filters (change per query)
    },
    {
      "stage_name": "llm_generation",
      "cache_ttl_seconds": 86400  // Cache summaries for 24 hours
    }
  ]
}

Monitoring Cache Performance

Track Hit Rate

GET /v1/analytics/cache/performance
Response:
{
  "hit_rate": 0.73,
  "total_requests": 10000,
  "cache_hits": 7300,
  "cache_misses": 2700,
  "avg_hit_latency_ms": 12,
  "avg_miss_latency_ms": 450
}
Target: >70% hit rate for stable workloads.

Per-Retriever Metrics

GET /v1/analytics/retrievers/{retriever_id}/performance
{
  "cache_hit_rate": 0.81,
  "avg_cache_hit_latency_ms": 8,
  "cache_size_mb": 45
}

Optimize Based on Metrics

MetricObservationAction
Hit rate <50%Queries too diverseIncrease TTL, add query canonicalization
Hit rate >90%Over-cachingReduce TTL to save memory
High miss latencyExpensive executionEnable stage-level caching
Large cache sizeMemory pressureReduce TTL or limit cached stages

Cost-Performance Trade-Offs

Scenario 1: High-Volume, Repetitive Queries

Example: E-commerce product search with trending queries Strategy:
{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 900,  // 15 minutes
    "cache_stage_names": ["knn_search", "rerank"]
  }
}
Result: 80% hit rate → 5x reduction in inference costs

Scenario 2: Unique, Expensive Queries

Example: Research assistant with LLM summarization Strategy:
{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 3600,  // 1 hour
    "cache_stage_names": ["llm_generation"]  // Only cache LLM outputs
  }
}
Result: Even 20% hit rate saves significant LLM costs

Scenario 3: Real-Time Dashboards

Example: Live analytics with frequent updates Strategy:
{
  "cache_config": {
    "enabled": true,
    "ttl_seconds": 30,  // 30 seconds
    "cache_key_fields": ["filters.date_range"]
  }
}
Result: Smooth user experience without excessive staleness

Advanced Patterns

Conditional Caching

Cache only if query matches criteria:
def should_cache(query: str) -> int:
    # Cache common queries longer
    if query in popular_queries:
        return 900  # 15 minutes
    # Cache LLM queries very long
    elif "summarize" in query or "explain" in query:
        return 3600  # 1 hour
    # Short cache for unique queries
    else:
        return 60  # 1 minute

ttl = should_cache(user_query)
result = mixpeek.retrievers.execute(
    retriever_id,
    inputs={"query": user_query},
    override_cache_ttl=ttl
)

Cache Warming Pipeline

Pre-populate cache with analytics-driven queries:
# Get top queries from last 24 hours
top_queries = mixpeek.analytics.get_top_queries(
    retriever_id=retriever_id,
    time_range="24h",
    limit=100
)

# Warm cache
for query_data in top_queries:
    mixpeek.retrievers.execute(
        retriever_id,
        inputs={"query": query_data["query"]}
    )

Multi-Tier Caching

Combine local (client-side) and remote (Mixpeek) caches:
# Client-side (Redis)
cached = redis_client.get(f"mixpeek:{retriever_id}:{query_hash}")
if cached:
    return json.loads(cached)

# Mixpeek-side (automatic)
result = mixpeek.retrievers.execute(retriever_id, inputs={"query": query})

# Store locally with shorter TTL
redis_client.setex(
    f"mixpeek:{retriever_id}:{query_hash}",
    60,  # 1 minute
    json.dumps(result)
)

Cache Expiration Strategies

Passive Expiration

Default behavior. Entries expire when TTL reached:
{ "ttl_seconds": 300 }

Active Refresh

Update cache before expiration (background job):
from apscheduler.schedulers.background import BackgroundScheduler

def refresh_cache():
    for query in critical_queries:
        mixpeek.retrievers.execute(retriever_id, inputs={"query": query})

scheduler = BackgroundScheduler()
scheduler.add_job(refresh_cache, 'interval', minutes=5)
scheduler.start()

Lazy Invalidation

Invalidate on write, but serve stale data during recompute:
# On document update
mixpeek.retrievers.invalidate_cache(retriever_id, allow_stale=True)

# Next request serves stale cache while recomputing in background

Best Practices

Begin with short TTLs (60-300s) and increase based on observed freshness requirements and hit rate.
Don’t cache cheap operations (filters, sorts). Focus on LLM, web search, and reranking stages.
Large caches impact performance. Tune TTLs and cache key fields to balance hit rate vs memory.
When reindexing or updating taxonomy, manually invalidate affected retriever caches.
Lowercase, trim, remove stop words, and canonicalize synonyms to maximize cache hits.
E.g., product-search-fresh (TTL=30s) vs product-search-stable (TTL=900s) for different UX contexts.

Troubleshooting

Low Hit Rate

Symptoms: <50% hit rate despite repetitive queries Diagnosis:
GET /v1/analytics/cache/performance?group_by=cache_key
Inspect cache key distribution. High cardinality indicates:
  • Unnecessary fields in cache_key_fields
  • Lack of query normalization
  • Session IDs or timestamps affecting keys
Fix: Add normalization, reduce key fields, or increase TTL.

Stale Results

Symptoms: Users report outdated search results Diagnosis: Check TTL vs update frequency Fix:
  1. Reduce TTL
  2. Implement event-driven invalidation
  3. Use allow_stale=false for critical queries

Cache Memory Pressure

Symptoms: High Redis memory usage, evictions Diagnosis:
GET /v1/analytics/cache/memory
Fix:
  1. Reduce TTL globally
  2. Limit cache_stage_names to expensive stages only
  3. Implement LRU eviction policy

Next Steps