Model Categories
| Category | Examples | Served From | Primary Use | 
|---|---|---|---|
| Embeddings | multilingual-e5-large-instruct,gte-modernbert-base, OpenAItext-embedding-3-*,clip_vit_l_14 | Ray Serve (GPU) or external API | Semantic search, multimodal similarity | 
| Sparse | splade_v1 | Ray Serve (CPU) | Hybrid and lexical search | 
| Multi-Vector | colbertv2 | Ray Serve (GPU) | Late interaction retrieval | 
| Rerankers | bge-reranker-v2-m3,cross-encoder | Ray Serve (GPU/CPU) | Reordering search results | 
| Generation | gpt-4,gpt-4-turbo,claude-3-opus,gemini-pro | External APIs | Summaries, transformations | 
| Audio | whisper_large_v3_turbo,pyannote-segmentation | Ray Serve (GPU) | Transcription, diarization | 
/v1/feature-extractors and /v1/retrievers/stages to discover supported models and parameters programmatically.
Choosing a Model
Feature Extractors
- parameters.modelselects the embedding model.
- Collections compute an output_schemareflecting the chosen model’s vector dimensions.
- Vector field names include the extractor version (e.g., text_extractor_v1_embedding).
Retriever Stages
- Most stages expose a modelparameter or use the feature URI to infer the model.
- Stage-level caching (cache_stage_names) combined with inference caching shortens repeated requests.
Versioning Strategy
Models are versioned independently of extractors:- Deploy new model version alongside the old one.
- Update a staging collection or retriever to reference the new model.
- Use Analytics (/v1/analytics/retrievers/...) to compare latency and relevance.
- Shift traffic gradually (e.g., update a percentage of retrievers).
- Deprecate the previous version once validated.
Performance & Scaling
- Ray Serve deployments define autoscaling policies (min_replicas,max_replicas, target concurrency).
- GPU workers host embedding and reranking models; CPU workers can serve lighter workloads.
- Inference caching hashes (model_name, inputs, parameters)to skip repeated calls.
- Stage telemetry (stage_statistics) exposes per-stage latency, cache hits, and token usage.
Cost Controls
- Use sparse or hybrid strategies when domain-specific vocabulary matters—dense-only pipelines can miss rare terms.
- Cache responses for high-volume retrievers to minimize repeated inference.
- For LLM-based stages, set budget limits via retriever budget_limitsto cap token spend and execution time.
- Monitor cache hit rates with GET /v1/analytics/retrievers/{id}/cache-performance.
When You Need Fine-Tuning
Mixpeek does not yet offer in-platform fine-tuning. If you need domain-specific embeddings or rerankers:- Train or fine-tune externally (e.g., Hugging Face, OpenAI, Vertex AI).
- Deploy the model to Ray Serve using a custom provider or call the external API directly from stages.
- Register the new model name/version in your extractor or retriever configuration.
Checklist Before Switching Models
- Update lower environments (dev/staging) first; ensure collections reprocess with the new model.
- Validate vector dimensions match expectations—retrievers must reference the correct feature URI.
- Monitor latency and cache metrics after rollout.
- Communicate index signature changes to teams relying on cached responses.
- Plan reindex time for large collections if the model upgrade changes vector dimensionality.

