Mixpeek provides two ways to run inference: built-in models served on Ray Serve, and custom models that you upload and deploy yourself. This page covers both approaches, with a focus on the custom model workflow for Enterprise users.
Built-in Models
Mixpeek ships with a curated set of models optimized for common tasks:
| Category | Examples | Primary Use |
|---|
| Embeddings | multilingual-e5-large-instruct, gte-modernbert-base, clip_vit_l_14 | Semantic search, multimodal similarity |
| Sparse | splade_v1 | Hybrid and lexical search |
| Multi-Vector | colbertv2 | Late interaction retrieval |
| Rerankers | bge-reranker-v2-m3, cross-encoder | Reordering search results |
| Generation | gpt-4, claude-3-opus, gemini-pro | Summaries, transformations |
| Audio | whisper_large_v3_turbo, pyannote-segmentation | Transcription, diarization |
Call /v1/feature-extractors and /v1/retrievers/stages to discover supported models programmatically.
Custom Models (Enterprise)
Custom models require an Enterprise tier subscription. Contact sales to upgrade.
Custom models let you bring your own trained weights—fine-tuned embeddings, domain-specific classifiers, or proprietary architectures—and run them on Mixpeek’s infrastructure with zero-copy sharing across workers.
How It Works
- Upload your model archive (
.tar.gz) containing weights
- Deploy to the Ray object store for fast access
- Use in custom plugins via the model loader API
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │────▶│ Deploy │────▶│ Use │
│ (S3 store) │ │ (Ray cache) │ │ (Inference) │
└─────────────┘ └─────────────┘ └─────────────┘
| Format | Extension | Description |
|---|
pytorch | .pt, .pth | PyTorch state_dict or TorchScript |
safetensors | .safetensors | SafeTensors format (recommended) |
onnx | .onnx | ONNX Runtime format |
huggingface | directory | HuggingFace model directory |
Quickstart
1. Create a Model Archive
Package your model weights into a .tar.gz archive:
import torch
import tarfile
import io
# Train or load your model
model = torch.nn.Linear(768, 256)
# Save weights
buffer = io.BytesIO()
torch.save(model.state_dict(), buffer)
buffer.seek(0)
# Create archive
with tarfile.open("my_model.tar.gz", "w:gz") as tar:
info = tarfile.TarInfo(name="model.pt")
info.size = len(buffer.getvalue())
tar.addfile(info, buffer)
2. Upload the Model
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-F "file=@my_model.tar.gz" \
-F "name=my-embedding-model" \
-F "version=1.0.0" \
-F "model_format=pytorch" \
-F "framework=pytorch" \
-F "task_type=embedding" \
-F "num_gpus=0" \
-F "memory_gb=4.0"
Response:
{
"success": true,
"model_id": "my-embedding-model_1_0_0",
"deployment_status": "pending",
"endpoint": "/models/ns_abc123/my-embedding-model_1_0_0",
"model_archive_url": "s3://mixpeek/..."
}
3. Deploy to Ray Object Store
Pre-load your model into the distributed cache for fast inference:
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models/my-embedding-model_1_0_0/deploy" \
-H "Authorization: Bearer $MIXPEEK_API_KEY"
Response:
{
"success": true,
"model_id": "my-embedding-model_1_0_0",
"namespace_id": "ns_abc123",
"deployment_status": "deployed",
"cached": true,
"message": "Model my-embedding-model_1_0_0 loaded into Ray object store"
}
4. Use in Custom Plugins
Load your model in a custom plugin with zero-copy access:
from engine.models.loader import load_namespace_model
import torch
class MyCustomProcessor:
def __init__(self):
# Load pre-uploaded weights (cached in Ray object store)
weights = load_namespace_model("my-embedding-model_1_0_0")
# Initialize model architecture
self.model = torch.nn.Linear(768, 256)
self.model.load_state_dict(weights)
self.model.eval()
def process(self, text_embedding):
with torch.no_grad():
return self.model(text_embedding)
Examples
Upload a HuggingFace Model
# Package a HuggingFace model directory
tar -czvf my-bert.tar.gz ./my-bert-model/
# Upload
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-F "[email protected]" \
-F "name=my-fine-tuned-bert" \
-F "version=2.0.0" \
-F "model_format=huggingface" \
-F "framework=sentence-transformers" \
-F "task_type=embedding" \
-F "num_gpus=1" \
-F "memory_gb=8.0"
Upload an ONNX Model
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-F "[email protected]" \
-F "name=product-classifier" \
-F "version=1.0.0" \
-F "model_format=onnx" \
-F "task_type=classification" \
-F "num_gpus=0" \
-F "memory_gb=2.0"
List All Models
curl "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
-H "Authorization: Bearer $MIXPEEK_API_KEY"
{
"success": true,
"models": [
{
"model_id": "my-embedding-model_1_0_0",
"name": "my-embedding-model",
"version": "1.0.0",
"model_format": "pytorch",
"deployed": true,
"created_at": "2025-01-10T12:00:00Z"
}
],
"total": 1
}
Python SDK Example
from mixpeek import Mixpeek
client = Mixpeek(api_key="sk_...")
# Upload model
with open("my_model.tar.gz", "rb") as f:
result = client.models.upload(
namespace_id="ns_abc123",
file=f,
name="my-reranker",
version="1.0.0",
model_format="pytorch",
task_type="reranking",
)
# Deploy model
client.models.deploy(
namespace_id="ns_abc123",
model_id=result.model_id,
)
# List models
models = client.models.list(namespace_id="ns_abc123")
Model Versioning
Models are versioned independently, allowing safe rollouts:
my-embedding-model_1_0_0 (production)
my-embedding-model_1_1_0 (staging)
my-embedding-model_2_0_0 (development)
Recommended rollout process:
- Upload new version alongside existing
- Deploy to Ray object store
- Update staging plugins to use new version
- Monitor performance via Analytics API
- Gradually shift production traffic
- Delete old versions when validated
Resource Requirements
When uploading, specify resource requirements for optimal scheduling:
| Parameter | Description | Default |
|---|
num_cpus | CPU cores required | 1.0 |
num_gpus | GPU devices required | 0 |
memory_gb | Memory allocation in GB | 4.0 |
# GPU model with high memory
-F "num_gpus=1" \
-F "memory_gb=16.0"
# CPU-only lightweight model
-F "num_cpus=0.5" \
-F "num_gpus=0" \
-F "memory_gb=2.0"
Limits & Quotas
| Limit | Value |
|---|
| Max models per namespace | 50 |
| Max archive size | 10 GB |
| Supported formats | pytorch, safetensors, onnx, huggingface |
| Required tier | Enterprise |
Troubleshooting
”Custom models require Enterprise tier”
Your organization must be on the Enterprise plan to use custom models. Contact sales to upgrade.
Model deployment fails
- Verify the archive format is valid
.tar.gz
- Check that weights match the declared
model_format
- Ensure sufficient memory is allocated
- Check engine health via
/v1/health
Model not found in plugin
- Verify the model was deployed (not just uploaded)
- Check
deployed: true in model details
- Ensure
namespace_id matches in plugin configuration
Slow inference
- Pre-deploy models before inference to warm the cache
- Check
cached: true in deployment response
- Consider increasing
num_gpus for large models
Custom models unlock the full power of Mixpeek’s distributed inference infrastructure with your own trained weights. Upload once, deploy to the Ray object store, and access with zero-copy sharing across all your custom plugins.