Credit Consumption Model
Ingestion Costs
Feature extraction (embeddings, OCR, transcription) and document writes
Retrieval Costs
Vector searches, hybrid fusion, reranking, and LLM generation stages
Storage Costs
Document payloads, vectors, and cached results in Qdrant/Redis
External Costs
Web search API calls, third-party model inference (OpenAI, Cohere)
Cost Breakdown by Operation
| Operation | Credit Cost | Optimization Leverage |
|---|---|---|
| Document creation | 1 credit | Low (required) |
| Text embedding (base) | 1 credit | Medium (model choice) |
| Text embedding (large) | 5 credits | High (model choice) |
| LLM generation (small) | 10-50 credits | High (prompt optimization) |
| LLM generation (large) | 50-500 credits | Very High (model, tokens) |
| KNN vector search | 0.1 credits | Low (efficient) |
| Hybrid search (RRF) | 0.2 credits | Low (efficient) |
| Reranking (cross-encoder) | 2-5 credits per doc | High (limit top-K) |
| Web search | 10 credits per query | High (cache aggressively) |
| OCR (per page) | 2-5 credits | Medium (resolution, model) |
| Video transcription (per min) | 5-10 credits | Medium (model choice) |
| Storage (per GB/month) | 100 credits | Medium (retention policies) |
Ingestion Optimization
1. Choose Efficient Models
Embeddings:| Model | Credits | Use Case |
|---|---|---|
multilingual-e5-base | 1 | High-volume, cost-sensitive |
multilingual-e5-large | 5 | Balanced accuracy/cost |
openai/text-embedding-3-large | 10 | Premium quality only |
| Model | Credits/min | Use Case |
|---|---|---|
whisper-base | 3 | Fast, moderate accuracy |
whisper-large-v3 | 10 | High accuracy, slower |
2. Deduplicate Before Ingestion
Avoid processing identical content:3. Optimize Chunking
Fewer chunks = lower cost:4. Selective Feature Extraction
Only extract features you’ll query:5. Batch Efficiently
Larger batches amortize overhead:6. Incremental Updates
Re-extract only changed content:Retrieval Optimization
1. Cache Aggressively
Cache expensive stages to avoid re-execution:2. Limit LLM Token Usage
Expensive:3. Rerank Only Top Candidates
Expensive:4. Filter Before Search
Apply cheap filters before expensive vector operations:5. Use Budget Limits
Prevent runaway costs:6. Avoid Web Search for Common Queries
Expensive:Storage Optimization
1. Enable Payload Selection
Don’t store full text in Qdrant if only metadata is needed for filtering:2. Set Retention Policies
Auto-delete old documents:3. Compress Metadata
Store compact representations:4. Use Sparse Vectors Selectively
Hybrid search requires both dense and sparse vectors:Monitoring & Optimization
1. Track Credit Consumption
2. Identify High-Cost Retrievers
3. Audit Extractor Performance
4. Set Cost Alerts
Configure webhooks to alert at 80% budget:Cost-Performance Trade-Offs
Scenario 1: High-Volume Product Search
Goal: 1M searches/month, <500ms p95 latency, $1000 budget Strategy:- Use
multilingual-e5-baseembeddings (1 credit vs 5) - Enable aggressive caching (TTL=900s)
- Limit to 50 results, no reranking
- Pre-filter by category to reduce search scope
Scenario 2: Research Assistant
Goal: Best accuracy, 10K queries/month, $2000 budget Strategy:- Use
openai/text-embedding-3-large(10 credits) - Rerank top 20 with cross-encoder (5 credits each)
- Generate summaries with
gpt-4o-mini(20 credits) - Cache LLM outputs (TTL=3600s)
Scenario 3: Document Ingestion (1TB)
Goal: Index 1TB of PDFs, $5000 budget Strategy:- Deduplicate by content hash (reduce by 30%)
- Use
whisper-basefor scanned pages - Chunk at 512 tokens (paragraph-level)
- Process in batches of 1000 documents
- Disable image extraction if not needed
ROI Analysis
Track cost vs business value:Quick Wins Checklist
1
Enable caching on all retrievers
Start with TTL=300s, adjust based on hit rate
2
Switch to base models for non-critical workloads
Use
multilingual-e5-base instead of large where accuracy delta is <5%3
Set budget limits on exploratory retrievers
Prevent runaway LLM costs from research/debugging queries
4
Deduplicate objects before ingestion
Hash content and skip processing identical documents
5
Optimize LLM prompts
Reduce
max_tokens, truncate inputs, use smaller models6
Filter before search
Apply metadata filters before vector operations
7
Review top credit consumers monthly
Use analytics to identify and optimize high-cost operations
Next Steps
- Monitor with Analytics Overview
- Understand limits via Rate Limits & Quotas
- Optimize caching with Caching Strategies
- Review Feature Extractors model options

