Skip to main content
Mixpeek uses per-stage caching where each stage of your retriever pipeline (inference, search, reranking) caches independently. This means you can get partial cache hits even when your pipeline partially changes, dramatically reducing compute costs.

Overview

  • Architecture: Per-stage caching (each stage caches independently)
  • Key Benefit: Partial cache hits save compute even when pipeline changes
  • Memory Management: LRU eviction (no TTL tuning needed)
  • Performance: Sub-millisecond cached responses, 80%+ cost reduction
  • Backend: Redis with automatic eviction

How It Works

The Problem with Traditional Caching

Traditional retriever caching is all-or-nothing:
Query: "dogs on skateboards"
Pipeline: Embed Query → Vector Search → Rerank → Return

Cache Key: hash(entire_pipeline)

❌ Change rerank model?
   → Full cache miss
   → Re-embed query (expensive GPU call)
   → Re-search vectors (expensive DB query)
   → Re-rerank with new model

Total waste: Embedding + Vector search compute

The Per-Stage Solution

With per-stage caching, each stage manages its own cache:
Query: "dogs on skateboards"

Stage 1: Inference (Embedding)
  Cache Key: query_text + model_config
  ✅ HIT → Reuse cached embedding

Stage 2: Vector Search
  Cache Key: embedding + filters + collection
  ✅ HIT → Reuse cached search results

Stage 3: Reranking
  Cache Key: doc_ids + rerank_model
  ❌ MISS → Only rerank (model changed!)

Result: Saved 95% of compute by reusing stages 1 & 2!

Automatic Operation

Stage caching happens automatically — you don’t need to configure anything. Each stage checks its cache before executing and stores results after execution.
# Execute a retriever (caching happens transparently)
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_acme" \
  -d '{
    "inputs": {"text": "machine learning"},
    "limit": 10
  }'

# First request: All stages MISS
# - Inference: Generate embedding (~100ms)
# - KNN: Search vectors (~1500ms)
# - Rerank: Rerank results (~200ms)
# Total: ~1800ms

# Second request: All stages HIT
# - Inference: Cached embedding (~0.4ms)
# - KNN: Cached results (~0.4ms)
# - Rerank: Cached ranking (~0.4ms)
# Total: ~1.2ms (1500x faster!)

Stage-Specific Caching

Inference Stage (Embeddings)

What gets cached: Text, image, and video embeddings Cache key: input_data + model_config When to invalidate: Rarely (only when embedding model changes)
Embeddings are deterministic — the same input with the same model always produces the same output. This makes inference caching extremely effective with near-zero invalidation.
// Cache entry examples
{
  "key": "stage:inference:ns_acme:hash_abc123",
  "value": [0.123, 0.456, 0.789, ...],  // 1536-dim embedding
  "inputs": {
    "text": "dogs on skateboards",
    "modality": "text"
  },
  "config": {
    "model": "text-embedding-3-small",
    "dimensions": 1536
  }
}
Cache hit scenarios:
  • ✅ Same query text
  • ✅ Same image URL or file
  • ✅ Same video URL or file
  • ✅ Same model configuration
Cache miss scenarios:
  • ❌ Different query text
  • ❌ Different model name or version
  • ❌ Different embedding dimensions

Vector Search Stage (KNN)

What gets cached: Document IDs and similarity scores from vector search Cache key: embedding + filters + collections + limit When to invalidate: When documents are added, updated, or deleted
// Cache entry example
{
  "key": "stage:knn_search:ns_acme:hash_def456",
  "value": [
    {"document_id": "doc_123", "score": 0.95},
    {"document_id": "doc_456", "score": 0.87},
    {"document_id": "doc_789", "score": 0.82}
  ],
  "inputs": {
    "embedding": [0.123, 0.456, ...],
    "collection_ids": ["col_articles"],
    "filters": {"category": "tech"},
    "limit": 10
  }
}
Cache hit scenarios:
  • ✅ Same embedding vector
  • ✅ Same filters
  • ✅ Same collection set
  • ✅ Same limit/offset
Cache miss scenarios:
  • ❌ Different embedding
  • ❌ Documents added/updated/deleted in collection
  • ❌ Different filters
  • ❌ Different limit
Vector search cache is automatically invalidated when documents change via webhook events. This ensures you never serve stale results.

Reranking Stage

What gets cached: Reranked document IDs and scores Cache key: document_ids + query + rerank_model + config When to invalidate: Rarely (only when rerank model/config changes)
// Cache entry example
{
  "key": "stage:rerank:ns_acme:hash_ghi789",
  "value": [
    {"document_id": "doc_789", "score": 0.99},
    {"document_id": "doc_123", "score": 0.94},
    {"document_id": "doc_456", "score": 0.88}
  ],
  "inputs": {
    "document_ids": ["doc_123", "doc_456", "doc_789"],
    "query": "dogs on skateboards",
    "strategy": "cross_encoder"
  },
  "config": {
    "model": "ms-marco-MiniLM-L-12-v2",
    "normalize": true
  }
}
Cache hit scenarios:
  • ✅ Same document set (order-independent)
  • ✅ Same query text
  • ✅ Same rerank strategy and model
  • ✅ Same configuration
Cache miss scenarios:
  • ❌ Different documents
  • ❌ Different query
  • ❌ Different rerank model
  • ❌ Different strategy (e.g., cross-encoder → RRF)

Why Per-Stage Caching Matters

Scenario 1: Changing Rerank Models

You want to experiment with different reranking models to improve relevance:
Initial Pipeline:
  Embed → Search → Rerank (ms-marco)

User changes to:
  Embed → Search → Rerank (bge-reranker)

With Traditional Caching:
  ❌ Full pipeline cache miss
  → Re-embed query ($$$)
  → Re-search vectors ($$)
  → Rerank with new model ($)
  Total: 100% compute used

With Per-Stage Caching:
  ✅ Inference cache HIT (embedding unchanged)
  ✅ Vector search cache HIT (results unchanged)
  ❌ Rerank cache MISS (model changed)
  Total: Only 5% compute used!

Scenario 2: Adding Documents

You ingest new documents into your collection:
Event: User adds 1,000 new documents

With Traditional Caching:
  ❌ Invalidate ALL cached queries
  → Every subsequent query is a full miss
  → Re-embed + re-search + re-rerank
  Total: 100% compute for every query

With Per-Stage Caching:
  ✅ Inference cache unchanged (embeddings are deterministic)
  ❌ Vector search cache invalidated (index changed)
  ❌ Rerank cache invalidated (document set changed)
  
  Next query:
  ✅ Inference: HIT (saved 60% of cost)
  ❌ Search: MISS
  ❌ Rerank: MISS
  Total: Only 40% compute used

Scenario 3: Cross-Pipeline Reuse

You have multiple retrievers using the same embedding model:
Pipeline A: Embed (e5-small) → Search → Rerank
Pipeline B: Embed (e5-small) → Search → LLM Generate

Query: "artificial intelligence" via Pipeline A
  → Inference: MISS → cache embedding
  → Search: MISS → cache
  → Rerank: MISS → cache

Same query via Pipeline B:
  → Inference: ✅ HIT (shared with A!)
  → Search: ✅ HIT (shared with A!)
  → LLM Generate: MISS (different stage)

Result: Saved 66% of compute by sharing inference + search cache

Cache Invalidation

Automatic Invalidation

Mixpeek automatically invalidates stage caches based on data changes:
EventStages InvalidatedReason
Document addedVector Search, RerankIndex contents changed
Document updatedVector Search, RerankDocument set changed
Document deletedVector Search, RerankIndex contents changed
Embedding model changedInferenceDifferent model = different embeddings
Rerank model changedRerankDifferent model = different scores
Inference cache is almost never invalidated because embeddings are deterministic. The same input always produces the same output for a given model.

Webhook-Driven Invalidation

When you configure webhooks, cache invalidation happens automatically:
// Document update webhook
{
  "event": "document.updated",
  "payload": {
    "namespace_id": "ns_acme",
    "collection_id": "col_articles",
    "document_id": "doc_123"
  }
}

// Automatic cache invalidation:
// ❌ Vector Search cache: Cleared for col_articles
// ❌ Rerank cache: Cleared for col_articles
// ✅ Inference cache: Unchanged (embeddings still valid)

Invalidation by Stage

Each stage has its own invalidation logic: Inference Stage:
  • Never invalidated (unless model changes)
  • Embeddings are deterministic and model-agnostic
Vector Search Stage:
  • Invalidated on document changes
  • When: add/update/delete documents
  • Scope: All searches for that collection
Rerank Stage:
  • Rarely invalidated
  • When: Rerank model/config changes
  • Scope: All rerank operations with that model

Memory Management (LRU)

Why LRU Instead of TTL?

Traditional caching uses TTL (time-to-live) where entries expire after a fixed duration. Per-stage caching uses LRU (Least Recently Used) eviction instead:
ApproachProblemSolution
TTLPopular queries expire arbitrarily❌ Wasted cache space
TTLUnpopular queries waste memory❌ No automatic cleanup
TTLHard to tune (1hr? 1day?)❌ Guesswork
LRUMost-used stays cached✅ Automatic optimization
LRULeast-used auto-evicted✅ Self-cleaning
LRUBounded memory usage✅ Predictable costs

How LRU Works

Redis Memory: 1GB (maxmemory limit - see Redis eviction policies)

Cache fills up:
  Inference: 500MB (83K embeddings)
  Search: 300MB (60K result sets)
  Rerank: 200MB (200K rankings)
  Total: 1GB (at limit)

New cache entry needs space:
  1. Redis identifies least recently used key
  2. Evicts that key automatically
  3. Stores new entry
  4. No manual intervention needed!

Result: Most popular queries stay cached naturally

Memory Distribution

Typical allocation for a retriever pipeline:
Total Redis: 1GB

Inference cache:   500MB  (50%)  ← Largest (embeddings are big)
Search cache:      300MB  (30%)  ← Medium (doc IDs + scores)  
Rerank cache:      200MB  (20%)  ← Smallest (reordered IDs)
Adjust memory allocation based on your workload:
  • Heavy embedding usage → allocate more to inference
  • Complex filters → allocate more to search
  • Multiple rerank models → allocate more to rerank

Performance

Benchmarks

Per-stage cache performance compared to uncached operations:
StageCache HITUncachedSpeedup
Inference (embedding)~0.4ms~100ms250x faster
Vector Search (KNN)~0.4ms~1500ms3750x faster
Reranking~0.4ms~200ms500x faster
Full Pipeline (all HITs)~1.2ms~1800ms1500x faster

Partial Cache Hit Benefits

Even when some stages miss, you still save compute:
ScenarioStagesTimeSavings
All cached✅✅✅1.2ms99.9%
Inference + Search cached✅✅❌~201ms89%
Only Inference cached✅❌❌~1701ms6%
Full miss❌❌❌~1800ms0% (baseline)
Even a single stage cache hit saves significant compute. Inference caching alone saves ~100ms per query!

Cost Savings

Example for 1M queries/day:
Without caching:
  1M queries × $0.002/query = $2,000/day

With 80% hit rate (all stages cached):
  800K cached queries ≈ $0 (just Redis overhead)
  200K uncached queries × $0.002 = $400/day
  
  Savings: $1,600/day = $48K/month = $576K/year

With partial cache hits (20% full miss, 60% partial hit):
  200K full hits: ~$0
  600K partial hits (inference cached): $600/day
  200K full miss: $400/day
  
  Savings: $1,000/day = $30K/month = $360K/year

Memory Efficiency

StageEntry Size1GB Capacity
Inference~6KB (1536-dim embedding)~170K embeddings
Vector Search~5KB (10 results)~200K result sets
Reranking~1KB (10 reranked IDs)~1M result sets

Best Practices

Optimizing Cache Hit Rates

1. Use consistent query formatting
# These are DIFFERENT cache keys:
{"text": "dogs"}           # One key
{"text": "Dogs"}           # Different key (case-sensitive)
{"text": " dogs "}         # Different key (whitespace)

# Normalize queries client-side for better hit rates
query = text.lower().strip()
2. Reuse embeddings across pipelines
If multiple retrievers use the same embedding model,
they automatically share inference cache!

Pipeline A: embed(text-embedding-3-small) → search → rerank
Pipeline B: embed(text-embedding-3-small) → search → generate

Both pipelines share the inference cache ✅
3. Monitor memory usage
# Check Redis memory stats
redis-cli INFO memory

# Key metrics:
# - used_memory: Current usage
# - maxmemory: Configured limit
# - evicted_keys: How many keys were evicted (LRU)
# - keyspace_hits/keyspace_misses: Hit rate

When Per-Stage Caching Helps Most

High query repetition - Same queries asked frequently
Model experimentation - Testing different rerank/generation models
Frequent document updates - Inference cache remains valid
Cross-pipeline workloads - Shared embeddings across retrievers
Expensive inference - GPU-based embedding models

When It Helps Less

⚠️ Unique queries every time - No repeated patterns to cache
⚠️ Real-time data - Documents change every second
⚠️ Simple pipelines - Single-stage retrievers (less benefit)

Security & Compliance

Namespace Isolation All cache keys include namespace ID for multi-tenancy security:
stage:inference:ns_acme:hash_abc123
stage:knn_search:ns_acme:hash_def456
Each tenant’s cache is completely isolated. TLS Support Use TLS for encrypted Redis connections:
# Use rediss:// protocol (note the extra 's')
export REDIS_URL="rediss://user:pass@prod-redis:6379"
GDPR Compliance LRU eviction supports right-to-be-forgotten:
  • Cached data expires automatically when evicted
  • No need for manual cleanup
  • Bounded retention (determined by memory limit)

Real-World Examples

Example 1: Simple Query with Full Cache Hit

# First request - all stages miss
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_acme" \
  -d '{"inputs": {"text": "machine learning tutorials"}}'

# Internally:
# Stage 1 (Inference): MISS → Generate embedding (100ms)
# Stage 2 (Search): MISS → Query vectors (1500ms)
# Stage 3 (Rerank): MISS → Rerank results (200ms)
# Total: ~1800ms

# Second request (same query)
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_acme" \
  -d '{"inputs": {"text": "machine learning tutorials"}}'

# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms)
# Stage 2 (Search): HIT → Cached results (0.4ms)
# Stage 3 (Rerank): HIT → Cached ranking (0.4ms)
# Total: ~1.2ms (1500x faster!)

Example 2: Partial Cache Hit (Model Change)

# User changes rerank model in retriever configuration
# Same query as before

curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_acme" \
  -d '{"inputs": {"text": "machine learning tutorials"}}'

# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms) ✅
# Stage 2 (Search): HIT → Cached results (0.4ms) ✅
# Stage 3 (Rerank): MISS → New model, must rerank (200ms) ❌
# Total: ~201ms (9x faster than full miss!)

Example 3: Partial Cache Hit (Document Update)

# User adds 100 new documents to collection
# Webhook triggers cache invalidation for search & rerank stages

curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_acme" \
  -d '{"inputs": {"text": "machine learning tutorials"}}'

# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms) ✅
# Stage 2 (Search): MISS → Index changed, must search (1500ms) ❌
# Stage 3 (Rerank): MISS → Doc set changed (200ms) ❌
# Total: ~1701ms (but saved 100ms from inference cache!)

Example 4: Cross-Pipeline Cache Sharing

# Pipeline A: embed → search → rerank
# Pipeline B: embed → search → LLM generate

# Execute Pipeline A
curl https://api.mixpeek.com/v1/retrievers/ret_pipeline_a/execute \
  -d '{"inputs": {"text": "AI safety"}}'

# Pipeline A:
# Inference: MISS → cache
# Search: MISS → cache
# Rerank: MISS → cache

# Execute Pipeline B (same query, different last stage)
curl https://api.mixpeek.com/v1/retrievers/ret_pipeline_b/execute \
  -d '{"inputs": {"text": "AI safety"}}'

# Pipeline B:
# Inference: HIT → shared with Pipeline A! ✅
# Search: HIT → shared with Pipeline A! ✅
# LLM Generate: MISS → different stage ❌

# Result: Saved 89% of compute by sharing 2/3 stages

Summary

Key Takeaways

Per-stage caching enables partial cache hits
LRU eviction eliminates TTL tuning
Automatic operation - no configuration needed
Cross-pipeline sharing - embeddings reused across retrievers
Smart invalidation - only affected stages are cleared

When to Expect Big Wins

  • 🔥 High query repetition (80%+ hit rate possible)
  • 🔥 Model experimentation (inference cache persists)
  • 🔥 Frequent updates (inference cache unaffected)
  • 🔥 Multiple pipelines (shared embedding cache)
I