Rate Limiting Model
Mixpeek uses token bucket rate limiting with per-minute and per-second windows:Request Rate Limits
Maximum requests per second (RPS) or per minute (RPM) per API key.
Credit Quotas
Monthly credit allocation for inference, storage, and compute operations.
Concurrent Requests
Maximum simultaneous in-flight requests per organization.
Resource Quotas
Limits on collections, documents, feature extractors, and batch sizes.
Rate Limit Tiers
| Tier | Requests/Min (RPM) | Requests/Sec (RPS) | Concurrent | Burst Allowance | 
|---|---|---|---|---|
| Free | 60 | 10 | 5 | 20 requests | 
| Pro | 600 | 100 | 50 | 200 requests | 
| Enterprise | Custom | Custom | Custom | Custom | 
Rate Limit Headers
Every API response includes rate limit metadata:Credit Quotas
Credits are consumed by:| Operation | Credit Cost | 
|---|---|
| Document creation | 1 credit per document | 
| Inference call (embedding) | 1-5 credits depending on model | 
| Inference call (LLM generation) | 10-100 credits based on tokens | 
| Vector search (KNN) | 0.1 credits per query | 
| Hybrid search (RRF) | 0.2 credits per query (multiple vectors) | 
| Web search (external API) | 10 credits per query | 
| Clustering execution | 50-500 credits based on dataset size | 
| Storage (per GB/month) | 100 credits | 
Monitoring Credit Usage
Check remaining credits via the Usage API:- 80% usage → warning
- 95% usage → critical
- 100% usage → operations blocked until reset or upgrade
Resource Quotas
Per-Organization Limits
| Resource | Free Tier | Pro Tier | Enterprise | 
|---|---|---|---|
| Namespaces | 1 | 10 | Unlimited | 
| Collections | 5 | 50 | Unlimited | 
| Buckets | 5 | 50 | Unlimited | 
| Documents | 10,000 | 1,000,000 | Unlimited | 
| Retrievers | 3 | 50 | Unlimited | 
| Taxonomies | 2 | 20 | Unlimited | 
| Clusters | 1 | 10 | Unlimited | 
| API Keys | 2 | 10 | Unlimited | 
| Batch Size | 100 objects | 10,000 objects | Custom | 
Enforcement
When a quota is exceeded:Scaling Strategies
1. Optimize Request Patterns
Problem: Hitting RPM limits during peak traffic Solutions:- Batch operations – use /batchendpoints to group objects/documents
- Cache aggressively – enable cache_configon retrievers to reduce redundant searches
- Async processing – submit batches asynchronously, poll task status instead of blocking
- Load shedding – deprioritize non-critical operations during peak hours
2. Distribute Load Across API Keys
Problem: Single API key hitting concurrency limit Solutions:- Issue separate API keys per service/team
- Use key rotation for different application environments (staging, prod)
- Monitor per-key usage: GET /v1/organizations/usage/api-keys/{key_id}
3. Reduce Credit Consumption
Problem: Exceeding monthly credit quota Solutions:| High-Cost Operation | Optimization | 
|---|---|
| LLM generation stages | Use smaller models (GPT-3.5 Turbo vs GPT-4), reduce max_tokens | 
| Frequent reprocessing | Implement incremental updates instead of full reindexing | 
| Large batch ingestion | Deduplicate objects before processing, filter out low-value content | 
| Exploratory searches | Apply pre-filters to reduce search scope, lower limitvalues | 
| Web search stages | Cache results with long TTL, fallback to internal collections | 
4. Upgrade Tier
When to upgrade:- Consistently hitting rate limits (>3 429 errors per hour)
- Credit usage >90% with 10+ days left in billing cycle
- Need for higher concurrency or batch sizes
- Require custom SLAs or dedicated infrastructure
Handling Rate Limit Errors
Exponential Backoff
Implement retry logic with exponential backoff:Circuit Breaker Pattern
Prevent cascading failures when rate limits are sustained:Graceful Degradation
When rate limited, fall back to cached or reduced-quality results:Endpoint-Specific Limits
Some endpoints have additional constraints:| Endpoint | Special Limit | Reason | 
|---|---|---|
| Batch Submit | 1 submission per batch every 60s | Prevents duplicate processing | 
| Cluster Execution | 1 concurrent execution per cluster | Resource-intensive operation | 
| Web Search Stages | 10 queries per minute (external API limit) | Third-party rate limit passthrough | 
| LLM Generation | 100K tokens per minute | Model provider constraint | 
| Document List | Max 10,000 results per query | Pagination required for large collections | 
Monitoring & Alerting
Proactive Monitoring
Track these metrics to avoid surprises:- Rate limit utilization – alert at 80% of RPM limit
- Credit burn rate – project end-of-month usage based on current trend
- Concurrent request count – alert when approaching tier limit
- 429 error frequency – spike indicates need for optimization or upgrade
Recommended Alerts
Best Practices
Implement client-side rate limiting
Implement client-side rate limiting
Don’t rely solely on server enforcement. Implement token bucket or leaky bucket algorithms in your client to smooth request distribution and avoid bursts.
Cache responses aggressively
Cache responses aggressively
Enable retriever-level caching with appropriate TTLs. For exploratory queries, cache for 5-15 minutes. For stable queries (e.g., product search), cache for hours.
Use batch operations
Use batch operations
Single-object operations consume rate limit budget faster. Batch 10-100 operations per request when possible.
Monitor per-API-key usage
Monitor per-API-key usage
Isolate noisy services by assigning separate API keys. Throttle or upgrade only the high-volume keys instead of affecting the entire org.
Set budget limits on retrievers
Set budget limits on retrievers
Configure 
budget_limits to prevent runaway costs from exploratory or LLM-heavy pipelines.Paginate large result sets
Paginate large result sets
Use 
offset and limit parameters instead of requesting thousands of documents at once. This reduces latency and credit consumption.Enterprise Options
For organizations with sustained high volume:- Custom rate limits – negotiate RPM/RPS based on traffic patterns
- Reserved capacity – pre-allocate Engine workers and inference quota
- Dedicated infrastructure – isolated Qdrant cluster, Redis, and Ray head nodes
- Credit pooling – share quota across multiple sub-organizations
- SLA guarantees – contractual uptime and p99 latency commitments
FAQ
Do rate limits apply per API key or per organization?
Do rate limits apply per API key or per organization?
Rate limits are enforced at the API key level, but concurrent request limits apply at the organization level. This allows you to distribute load across multiple keys while respecting org-wide concurrency caps.
Are retries counted against rate limits?
Are retries counted against rate limits?
Yes. Every request, including retries, counts toward your RPM/RPS limits. Implement exponential backoff to avoid wasting quota on rapid retries.
Can I request a temporary quota increase?
Can I request a temporary quota increase?
Yes. Contact support with your use case (e.g., annual reindexing, event-driven spike). We can provision temporary credit boosts or rate limit exemptions.
What happens if I exceed storage quota?
What happens if I exceed storage quota?
New document creation fails with a 
QuotaExceededError. Existing documents remain queryable. Delete unused documents or upgrade tier to restore write access.Do cached retriever responses consume credits?
Do cached retriever responses consume credits?
No. Cache hits are free and don’t count toward inference or search credits. Maximize cache hit rate to reduce costs.
Next Steps
- Monitor usage via Organization Usage API
- Review Analytics Overview for cost optimization strategies
- Configure Webhooks to alert on quota thresholds
- Optimize retriever performance with Caching Strategies

