Mixpeek uses per-stage caching where each stage of your retriever pipeline (inference, search, reranking) caches independently. This means you can get partial cache hits even when your pipeline partially changes, dramatically reducing compute costs.
Overview
- Architecture: Per-stage caching (each stage caches independently)
- Key Benefit: Partial cache hits save compute even when pipeline changes
- Memory Management: LRU eviction (no TTL tuning needed)
- Performance: Sub-millisecond cached responses, 80%+ cost reduction
- Backend: Redis with automatic eviction
How It Works
The Problem with Traditional Caching
Traditional retriever caching is all-or-nothing:
Query: "dogs on skateboards"
Pipeline: Embed Query → Vector Search → Rerank → Return
Cache Key: hash(entire_pipeline)
❌ Change rerank model?
→ Full cache miss
→ Re-embed query (expensive GPU call)
→ Re-search vectors (expensive DB query)
→ Re-rerank with new model
Total waste: Embedding + Vector search compute
The Per-Stage Solution
With per-stage caching, each stage manages its own cache:
Query: "dogs on skateboards"
Stage 1: Inference (Embedding)
Cache Key: query_text + model_config
✅ HIT → Reuse cached embedding
Stage 2: Vector Search
Cache Key: embedding + filters + collection
✅ HIT → Reuse cached search results
Stage 3: Reranking
Cache Key: doc_ids + rerank_model
❌ MISS → Only rerank (model changed!)
Result: Saved 95% of compute by reusing stages 1 & 2!
Automatic Operation
Stage caching happens automatically — you don’t need to configure anything. Each stage checks its cache before executing and stores results after execution.
# Execute a retriever (caching happens transparently)
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
-H "Authorization: Bearer $API_KEY" \
-H "X-Namespace: ns_acme" \
-d '{
"inputs": {"text": "machine learning"},
"limit": 10
}'
# First request: All stages MISS
# - Inference: Generate embedding (~100ms)
# - KNN: Search vectors (~1500ms)
# - Rerank: Rerank results (~200ms)
# Total: ~1800ms
# Second request: All stages HIT
# - Inference: Cached embedding (~0.4ms)
# - KNN: Cached results (~0.4ms)
# - Rerank: Cached ranking (~0.4ms)
# Total: ~1.2ms (1500x faster!)
Stage-Specific Caching
Inference Stage (Embeddings)
What gets cached: Text, image, and video embeddings
Cache key: input_data + model_config
When to invalidate: Rarely (only when embedding model changes)
Embeddings are deterministic — the same input with the same model always produces the same output. This makes inference caching extremely effective with near-zero invalidation.
// Cache entry examples
{
"key": "stage:inference:ns_acme:hash_abc123",
"value": [0.123, 0.456, 0.789, ...], // 1536-dim embedding
"inputs": {
"text": "dogs on skateboards",
"modality": "text"
},
"config": {
"model": "text-embedding-3-small",
"dimensions": 1536
}
}
Cache hit scenarios:
- ✅ Same query text
- ✅ Same image URL or file
- ✅ Same video URL or file
- ✅ Same model configuration
Cache miss scenarios:
- ❌ Different query text
- ❌ Different model name or version
- ❌ Different embedding dimensions
Vector Search Stage (KNN)
What gets cached: Document IDs and similarity scores from vector search
Cache key: embedding + filters + collections + limit
When to invalidate: When documents are added, updated, or deleted
// Cache entry example
{
"key": "stage:knn_search:ns_acme:hash_def456",
"value": [
{"document_id": "doc_123", "score": 0.95},
{"document_id": "doc_456", "score": 0.87},
{"document_id": "doc_789", "score": 0.82}
],
"inputs": {
"embedding": [0.123, 0.456, ...],
"collection_ids": ["col_articles"],
"filters": {"category": "tech"},
"limit": 10
}
}
Cache hit scenarios:
- ✅ Same embedding vector
- ✅ Same filters
- ✅ Same collection set
- ✅ Same limit/offset
Cache miss scenarios:
- ❌ Different embedding
- ❌ Documents added/updated/deleted in collection
- ❌ Different filters
- ❌ Different limit
Vector search cache is automatically invalidated when documents change via webhook events. This ensures you never serve stale results.
Reranking Stage
What gets cached: Reranked document IDs and scores
Cache key: document_ids + query + rerank_model + config
When to invalidate: Rarely (only when rerank model/config changes)
// Cache entry example
{
"key": "stage:rerank:ns_acme:hash_ghi789",
"value": [
{"document_id": "doc_789", "score": 0.99},
{"document_id": "doc_123", "score": 0.94},
{"document_id": "doc_456", "score": 0.88}
],
"inputs": {
"document_ids": ["doc_123", "doc_456", "doc_789"],
"query": "dogs on skateboards",
"strategy": "cross_encoder"
},
"config": {
"model": "ms-marco-MiniLM-L-12-v2",
"normalize": true
}
}
Cache hit scenarios:
- ✅ Same document set (order-independent)
- ✅ Same query text
- ✅ Same rerank strategy and model
- ✅ Same configuration
Cache miss scenarios:
- ❌ Different documents
- ❌ Different query
- ❌ Different rerank model
- ❌ Different strategy (e.g., cross-encoder → RRF)
Why Per-Stage Caching Matters
Scenario 1: Changing Rerank Models
You want to experiment with different reranking models to improve relevance:
Initial Pipeline:
Embed → Search → Rerank (ms-marco)
User changes to:
Embed → Search → Rerank (bge-reranker)
With Traditional Caching:
❌ Full pipeline cache miss
→ Re-embed query ($$$)
→ Re-search vectors ($$)
→ Rerank with new model ($)
Total: 100% compute used
With Per-Stage Caching:
✅ Inference cache HIT (embedding unchanged)
✅ Vector search cache HIT (results unchanged)
❌ Rerank cache MISS (model changed)
Total: Only 5% compute used!
Scenario 2: Adding Documents
You ingest new documents into your collection:
Event: User adds 1,000 new documents
With Traditional Caching:
❌ Invalidate ALL cached queries
→ Every subsequent query is a full miss
→ Re-embed + re-search + re-rerank
Total: 100% compute for every query
With Per-Stage Caching:
✅ Inference cache unchanged (embeddings are deterministic)
❌ Vector search cache invalidated (index changed)
❌ Rerank cache invalidated (document set changed)
Next query:
✅ Inference: HIT (saved 60% of cost)
❌ Search: MISS
❌ Rerank: MISS
Total: Only 40% compute used
Scenario 3: Cross-Pipeline Reuse
You have multiple retrievers using the same embedding model:
Pipeline A: Embed (e5-small) → Search → Rerank
Pipeline B: Embed (e5-small) → Search → LLM Generate
Query: "artificial intelligence" via Pipeline A
→ Inference: MISS → cache embedding
→ Search: MISS → cache
→ Rerank: MISS → cache
Same query via Pipeline B:
→ Inference: ✅ HIT (shared with A!)
→ Search: ✅ HIT (shared with A!)
→ LLM Generate: MISS (different stage)
Result: Saved 66% of compute by sharing inference + search cache
Cache Invalidation
Automatic Invalidation
Mixpeek automatically invalidates stage caches based on data changes:
Event | Stages Invalidated | Reason |
---|
Document added | Vector Search, Rerank | Index contents changed |
Document updated | Vector Search, Rerank | Document set changed |
Document deleted | Vector Search, Rerank | Index contents changed |
Embedding model changed | Inference | Different model = different embeddings |
Rerank model changed | Rerank | Different model = different scores |
Inference cache is almost never invalidated because embeddings are deterministic. The same input always produces the same output for a given model.
Webhook-Driven Invalidation
When you configure webhooks, cache invalidation happens automatically:
// Document update webhook
{
"event": "document.updated",
"payload": {
"namespace_id": "ns_acme",
"collection_id": "col_articles",
"document_id": "doc_123"
}
}
// Automatic cache invalidation:
// ❌ Vector Search cache: Cleared for col_articles
// ❌ Rerank cache: Cleared for col_articles
// ✅ Inference cache: Unchanged (embeddings still valid)
Invalidation by Stage
Each stage has its own invalidation logic:
Inference Stage:
- ✅ Never invalidated (unless model changes)
- Embeddings are deterministic and model-agnostic
Vector Search Stage:
- ❌ Invalidated on document changes
- When: add/update/delete documents
- Scope: All searches for that collection
Rerank Stage:
- ❌ Rarely invalidated
- When: Rerank model/config changes
- Scope: All rerank operations with that model
Memory Management (LRU)
Why LRU Instead of TTL?
Traditional caching uses TTL (time-to-live) where entries expire after a fixed duration. Per-stage caching uses LRU (Least Recently Used) eviction instead:
Approach | Problem | Solution |
---|
TTL | Popular queries expire arbitrarily | ❌ Wasted cache space |
TTL | Unpopular queries waste memory | ❌ No automatic cleanup |
TTL | Hard to tune (1hr? 1day?) | ❌ Guesswork |
LRU | Most-used stays cached | ✅ Automatic optimization |
LRU | Least-used auto-evicted | ✅ Self-cleaning |
LRU | Bounded memory usage | ✅ Predictable costs |
How LRU Works
Redis Memory: 1GB (maxmemory limit - see Redis eviction policies)
Cache fills up:
Inference: 500MB (83K embeddings)
Search: 300MB (60K result sets)
Rerank: 200MB (200K rankings)
Total: 1GB (at limit)
New cache entry needs space:
1. Redis identifies least recently used key
2. Evicts that key automatically
3. Stores new entry
4. No manual intervention needed!
Result: Most popular queries stay cached naturally
Memory Distribution
Typical allocation for a retriever pipeline:
Total Redis: 1GB
Inference cache: 500MB (50%) ← Largest (embeddings are big)
Search cache: 300MB (30%) ← Medium (doc IDs + scores)
Rerank cache: 200MB (20%) ← Smallest (reordered IDs)
Adjust memory allocation based on your workload:
- Heavy embedding usage → allocate more to inference
- Complex filters → allocate more to search
- Multiple rerank models → allocate more to rerank
Benchmarks
Per-stage cache performance compared to uncached operations:
Stage | Cache HIT | Uncached | Speedup |
---|
Inference (embedding) | ~0.4ms | ~100ms | 250x faster |
Vector Search (KNN) | ~0.4ms | ~1500ms | 3750x faster |
Reranking | ~0.4ms | ~200ms | 500x faster |
Full Pipeline (all HITs) | ~1.2ms | ~1800ms | 1500x faster |
Partial Cache Hit Benefits
Even when some stages miss, you still save compute:
Scenario | Stages | Time | Savings |
---|
All cached | ✅✅✅ | 1.2ms | 99.9% |
Inference + Search cached | ✅✅❌ | ~201ms | 89% |
Only Inference cached | ✅❌❌ | ~1701ms | 6% |
Full miss | ❌❌❌ | ~1800ms | 0% (baseline) |
Even a single stage cache hit saves significant compute. Inference caching alone saves ~100ms per query!
Cost Savings
Example for 1M queries/day:
Without caching:
1M queries × $0.002/query = $2,000/day
With 80% hit rate (all stages cached):
800K cached queries ≈ $0 (just Redis overhead)
200K uncached queries × $0.002 = $400/day
Savings: $1,600/day = $48K/month = $576K/year
With partial cache hits (20% full miss, 60% partial hit):
200K full hits: ~$0
600K partial hits (inference cached): $600/day
200K full miss: $400/day
Savings: $1,000/day = $30K/month = $360K/year
Memory Efficiency
Stage | Entry Size | 1GB Capacity |
---|
Inference | ~6KB (1536-dim embedding) | ~170K embeddings |
Vector Search | ~5KB (10 results) | ~200K result sets |
Reranking | ~1KB (10 reranked IDs) | ~1M result sets |
Best Practices
Optimizing Cache Hit Rates
1. Use consistent query formatting
# These are DIFFERENT cache keys:
{"text": "dogs"} # One key
{"text": "Dogs"} # Different key (case-sensitive)
{"text": " dogs "} # Different key (whitespace)
# Normalize queries client-side for better hit rates
query = text.lower().strip()
2. Reuse embeddings across pipelines
If multiple retrievers use the same embedding model,
they automatically share inference cache!
Pipeline A: embed(text-embedding-3-small) → search → rerank
Pipeline B: embed(text-embedding-3-small) → search → generate
Both pipelines share the inference cache ✅
3. Monitor memory usage
# Check Redis memory stats
redis-cli INFO memory
# Key metrics:
# - used_memory: Current usage
# - maxmemory: Configured limit
# - evicted_keys: How many keys were evicted (LRU)
# - keyspace_hits/keyspace_misses: Hit rate
When Per-Stage Caching Helps Most
✅ High query repetition - Same queries asked frequently
✅ Model experimentation - Testing different rerank/generation models
✅ Frequent document updates - Inference cache remains valid
✅ Cross-pipeline workloads - Shared embeddings across retrievers
✅ Expensive inference - GPU-based embedding models
When It Helps Less
⚠️ Unique queries every time - No repeated patterns to cache
⚠️ Real-time data - Documents change every second
⚠️ Simple pipelines - Single-stage retrievers (less benefit)
Security & Compliance
Namespace Isolation
All cache keys include namespace ID for multi-tenancy security:
stage:inference:ns_acme:hash_abc123
stage:knn_search:ns_acme:hash_def456
Each tenant’s cache is completely isolated.
TLS Support
Use TLS for encrypted Redis connections:
# Use rediss:// protocol (note the extra 's')
export REDIS_URL="rediss://user:pass@prod-redis:6379"
GDPR Compliance
LRU eviction supports right-to-be-forgotten:
- Cached data expires automatically when evicted
- No need for manual cleanup
- Bounded retention (determined by memory limit)
Real-World Examples
Example 1: Simple Query with Full Cache Hit
# First request - all stages miss
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
-H "Authorization: Bearer $API_KEY" \
-H "X-Namespace: ns_acme" \
-d '{"inputs": {"text": "machine learning tutorials"}}'
# Internally:
# Stage 1 (Inference): MISS → Generate embedding (100ms)
# Stage 2 (Search): MISS → Query vectors (1500ms)
# Stage 3 (Rerank): MISS → Rerank results (200ms)
# Total: ~1800ms
# Second request (same query)
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
-H "Authorization: Bearer $API_KEY" \
-H "X-Namespace: ns_acme" \
-d '{"inputs": {"text": "machine learning tutorials"}}'
# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms)
# Stage 2 (Search): HIT → Cached results (0.4ms)
# Stage 3 (Rerank): HIT → Cached ranking (0.4ms)
# Total: ~1.2ms (1500x faster!)
Example 2: Partial Cache Hit (Model Change)
# User changes rerank model in retriever configuration
# Same query as before
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
-H "Authorization: Bearer $API_KEY" \
-H "X-Namespace: ns_acme" \
-d '{"inputs": {"text": "machine learning tutorials"}}'
# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms) ✅
# Stage 2 (Search): HIT → Cached results (0.4ms) ✅
# Stage 3 (Rerank): MISS → New model, must rerank (200ms) ❌
# Total: ~201ms (9x faster than full miss!)
Example 3: Partial Cache Hit (Document Update)
# User adds 100 new documents to collection
# Webhook triggers cache invalidation for search & rerank stages
curl https://api.mixpeek.com/v1/retrievers/ret_semantic/execute \
-H "Authorization: Bearer $API_KEY" \
-H "X-Namespace: ns_acme" \
-d '{"inputs": {"text": "machine learning tutorials"}}'
# Internally:
# Stage 1 (Inference): HIT → Cached embedding (0.4ms) ✅
# Stage 2 (Search): MISS → Index changed, must search (1500ms) ❌
# Stage 3 (Rerank): MISS → Doc set changed (200ms) ❌
# Total: ~1701ms (but saved 100ms from inference cache!)
Example 4: Cross-Pipeline Cache Sharing
# Pipeline A: embed → search → rerank
# Pipeline B: embed → search → LLM generate
# Execute Pipeline A
curl https://api.mixpeek.com/v1/retrievers/ret_pipeline_a/execute \
-d '{"inputs": {"text": "AI safety"}}'
# Pipeline A:
# Inference: MISS → cache
# Search: MISS → cache
# Rerank: MISS → cache
# Execute Pipeline B (same query, different last stage)
curl https://api.mixpeek.com/v1/retrievers/ret_pipeline_b/execute \
-d '{"inputs": {"text": "AI safety"}}'
# Pipeline B:
# Inference: HIT → shared with Pipeline A! ✅
# Search: HIT → shared with Pipeline A! ✅
# LLM Generate: MISS → different stage ❌
# Result: Saved 89% of compute by sharing 2/3 stages
Summary
Key Takeaways
✅ Per-stage caching enables partial cache hits
✅ LRU eviction eliminates TTL tuning
✅ Automatic operation - no configuration needed
✅ Cross-pipeline sharing - embeddings reused across retrievers
✅ Smart invalidation - only affected stages are cleared
When to Expect Big Wins
- 🔥 High query repetition (80%+ hit rate possible)
- 🔥 Model experimentation (inference cache persists)
- 🔥 Frequent updates (inference cache unaffected)
- 🔥 Multiple pipelines (shared embedding cache)