Model Categories
| Category | Examples | Served From | Primary Use |
|---|---|---|---|
| Embeddings | multilingual-e5-large-instruct, gte-modernbert-base, OpenAI text-embedding-3-*, clip_vit_l_14 | Ray Serve (GPU) or external API | Semantic search, multimodal similarity |
| Sparse | splade_v1 | Ray Serve (CPU) | Hybrid and lexical search |
| Multi-Vector | colbertv2 | Ray Serve (GPU) | Late interaction retrieval |
| Rerankers | bge-reranker-v2-m3, cross-encoder | Ray Serve (GPU/CPU) | Reordering search results |
| Generation | gpt-4, gpt-4-turbo, claude-3-opus, gemini-pro | External APIs | Summaries, transformations |
| Audio | whisper_large_v3_turbo, pyannote-segmentation | Ray Serve (GPU) | Transcription, diarization |
/v1/feature-extractors and /v1/retrievers/stages to discover supported models and parameters programmatically.
Choosing a Model
Feature Extractors
parameters.modelselects the embedding model.- Collections compute an
output_schemareflecting the chosen model’s vector dimensions. - Vector field names include the extractor version (e.g.,
text_extractor_v1_embedding).
Retriever Stages
- Most stages expose a
modelparameter or use the feature URI to infer the model. - Stage-level caching (
cache_stage_names) combined with inference caching shortens repeated requests.
Versioning Strategy
Models are versioned independently of extractors:- Deploy new model version alongside the old one.
- Update a staging collection or retriever to reference the new model.
- Use Analytics (
/v1/analytics/retrievers/...) to compare latency and relevance. - Shift traffic gradually (e.g., update a percentage of retrievers).
- Deprecate the previous version once validated.
Performance & Scaling
- Ray Serve deployments define autoscaling policies (
min_replicas,max_replicas, target concurrency). - GPU workers host embedding and reranking models; CPU workers can serve lighter workloads.
- Inference caching hashes
(model_name, inputs, parameters)to skip repeated calls. - Stage telemetry (
stage_statistics) exposes per-stage latency, cache hits, and token usage.
Cost Controls
- Use sparse or hybrid strategies when domain-specific vocabulary matters—dense-only pipelines can miss rare terms.
- Cache responses for high-volume retrievers to minimize repeated inference.
- For LLM-based stages, set budget limits via retriever
budget_limitsto cap token spend and execution time. - Monitor cache hit rates with
GET /v1/analytics/retrievers/{id}/cache-performance.
When You Need Fine-Tuning
Mixpeek does not yet offer in-platform fine-tuning. If you need domain-specific embeddings or rerankers:- Train or fine-tune externally (e.g., Hugging Face, OpenAI, Vertex AI).
- Deploy the model to Ray Serve using a custom provider or call the external API directly from stages.
- Register the new model name/version in your extractor or retriever configuration.
Checklist Before Switching Models
- Update lower environments (dev/staging) first; ensure collections reprocess with the new model.
- Validate vector dimensions match expectations—retrievers must reference the correct feature URI.
- Monitor latency and cache metrics after rollout.
- Communicate index signature changes to teams relying on cached responses.
- Plan reindex time for large collections if the model upgrade changes vector dimensionality.

