Mixpeek provides multiple ways to run inference with ML models:
- Built-in Models - Pre-configured models served on Ray Serve
- HuggingFace Models - Any HF model with automatic cluster-wide caching
- Custom Models (Enterprise) - Your own uploaded weights via S3
This page covers all approaches, including the LazyModelMixin for efficient model loading in plugins.
Built-in Models
Mixpeek ships with curated models optimized for common tasks:
| Category | Examples | Primary Use |
|---|
| Embeddings | multilingual-e5-large-instruct, gte-modernbert-base, clip_vit_l_14 | Semantic search, multimodal similarity |
| Sparse | splade_v1 | Hybrid and lexical search |
| Multi-Vector | colbertv2 | Late interaction retrieval |
| Rerankers | bge-reranker-v2-m3, cross-encoder | Reordering search results |
| Generation | gpt-4, claude-3-opus, gemini-pro | Summaries, transformations |
| Audio | whisper_large_v3_turbo, pyannote-segmentation | Transcription, diarization |
Call /v1/feature-extractors and /v1/retrievers/stages to discover supported models programmatically.
HuggingFace Models with Cluster Caching
Load any HuggingFace model with automatic cluster-wide caching. The ModelRegistry downloads the model once and shares it across all actors via Ray’s object store.
Direct Loading
from engine.models.loader import load_hf_model
class MyProcessor:
def __init__(self, config, **kwargs):
self._model = None
self._tokenizer = None
self.config = config
def _ensure_model_loaded(self):
if self._model is not None:
return
# Loads from HF (first time) or Ray cache (subsequent calls)
cached = load_hf_model(
hf_model_id="intfloat/multilingual-e5-large-instruct",
model_class="AutoModel",
tokenizer_class="AutoTokenizer",
torch_dtype="float16",
)
# Instantiate from cached state dict
from transformers import AutoModel, AutoConfig, AutoTokenizer
config = AutoConfig.from_dict(cached["config"])
self._model = AutoModel.from_config(config)
self._model.load_state_dict(cached["state_dict"])
self._model.to("cuda")
self._model.eval()
# Load tokenizer
self._tokenizer = AutoTokenizer.from_pretrained(
cached["tokenizer_config"]["tokenizer_dir"]
)
def __call__(self, batch):
self._ensure_model_loaded()
# Use self._model and self._tokenizer...
Return Value
load_hf_model() returns a dictionary containing:
{
"state_dict": dict, # Model weights (load with model.load_state_dict())
"config": dict, # Model config (use with AutoConfig.from_dict())
"tokenizer_config": { # Tokenizer info (if tokenizer_class provided)
"tokenizer_dir": str, # Path to tokenizer files
"class": str, # Tokenizer class name
},
"model_class": str, # The model class used
"hf_model_id": str, # The HuggingFace model ID
}
Async Loading for Parallel Models
Load multiple models in parallel with the async API:
from engine.models.loader import load_hf_model_async
import ray
# Start loading both models in parallel
refs = [
load_hf_model_async("intfloat/multilingual-e5-large-instruct"),
load_hf_model_async("openai/clip-vit-large-patch14"),
]
# Wait for both to complete
e5_data, clip_data = ray.get(refs)
Check Cache Status
from engine.models.loader import is_hf_model_cached
if is_hf_model_cached("intfloat/multilingual-e5-large-instruct"):
print("Model already cached in cluster")
LazyModelMixin (Recommended)
The LazyModelMixin provides automatic lazy loading with cluster-wide caching. Models only load when first needed, not at actor creation time.
Benefits
- Resource Efficiency: Models load on first batch, not at actor creation
- Cluster Caching: One download shared across all actors
- Zero-Copy Sharing: Ray object store provides zero-copy access
- Automatic Device Detection: Finds CUDA, MPS, or CPU automatically
Usage
from engine.models.lazy import LazyModelMixin
from engine.inference.services import BaseBatchInferenceService
class MyEmbeddingProcessor(LazyModelMixin, BaseBatchInferenceService):
# Configure model via class attributes
model_id = "intfloat/multilingual-e5-large-instruct"
model_class = "AutoModel"
tokenizer_class = "AutoTokenizer"
torch_dtype = "float16"
model_source = "huggingface" # or "namespace" for S3 models
def __init__(self, config, **kwargs):
super().__init__(**kwargs)
self.config = config
# Don't load model here - LazyModelMixin handles it
def _process_batch(self, batch):
# Model is automatically loaded on first call
model, tokenizer = self.get_model()
# Process batch
inputs = tokenizer(
batch["text"].tolist(),
padding=True,
truncation=True,
return_tensors="pt",
)
with torch.no_grad():
outputs = model(**inputs)
batch["embedding"] = outputs.last_hidden_state.mean(dim=1).tolist()
return batch
Class Attributes
| Attribute | Type | Default | Description |
|---|
model_id | str | "" | HuggingFace model ID or namespace model ID |
model_class | str | "AutoModel" | Transformers model class name |
tokenizer_class | str | None | "AutoTokenizer" | Tokenizer class, or None to skip |
torch_dtype | str | "float32" | Torch dtype: “float16”, “float32”, “bfloat16” |
model_source | str | "huggingface" | ”huggingface” or “namespace” |
Methods
| Method | Description |
|---|
get_model() | Returns (model, tokenizer) tuple, loading if needed |
ensure_model_loaded() | Explicitly trigger model loading |
model | Property access to model |
tokenizer | Property access to tokenizer |
Custom Model Instantiation
Override _instantiate_model() for custom loading logic:
class MyCustomProcessor(LazyModelMixin, BaseBatchInferenceService):
model_id = "my-custom-model"
model_source = "namespace" # Load from S3
def _instantiate_model(self, cached_data):
"""Custom instantiation for non-standard models."""
import torch
# Create your model architecture
model = torch.nn.Sequential(
torch.nn.Linear(768, 512),
torch.nn.ReLU(),
torch.nn.Linear(512, 256),
)
# Load weights from cached data
model.load_state_dict(cached_data)
model.to(self._detect_device())
model.eval()
return model, None # No tokenizer
@lazy_model Decorator
For simpler cases, use the @lazy_model decorator instead of the mixin:
from engine.models.lazy import lazy_model
class MyProcessor:
def __init__(self, config, **kwargs):
self.config = config
self._model = None
self._tokenizer = None
@lazy_model(
model_id="intfloat/multilingual-e5-large-instruct",
model_class="AutoModel",
tokenizer_class="AutoTokenizer",
torch_dtype="float16",
)
def __call__(self, batch):
# self._model and self._tokenizer are automatically available
inputs = self._tokenizer(
batch["text"].tolist(),
return_tensors="pt",
padding=True,
)
with torch.no_grad():
outputs = self._model(**inputs)
batch["embedding"] = outputs.last_hidden_state.mean(dim=1).tolist()
return batch
Decorator Parameters
| Parameter | Type | Default | Description |
|---|
model_id | str | required | HuggingFace or namespace model ID |
model_class | str | "AutoModel" | Transformers model class |
tokenizer_class | str | None | "AutoTokenizer" | Tokenizer class |
torch_dtype | str | "float32" | Torch dtype |
model_source | str | "huggingface" | ”huggingface” or “namespace” |
Available Attributes After Decoration
The decorator sets these attributes on self:
self._model - The loaded model
self._tokenizer - The loaded tokenizer (or None)
self._device - The device the model is on (“cuda”, “mps”, or “cpu”)
Custom Models (Enterprise)
Custom models require an Enterprise tier subscription. Contact sales to upgrade.
Upload your own trained weights—fine-tuned embeddings, domain-specific classifiers, or proprietary architectures—and run them on Mixpeek’s infrastructure.
How It Works
- Upload your model archive (
.tar.gz) containing weights
- Deploy to the Ray object store for fast access
- Use in custom plugins via
load_namespace_model()
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Upload │────▶│ Deploy │────▶│ Use │
│ (S3 store) │ │ (Ray cache) │ │ (Inference) │
└─────────────┘ └─────────────┘ └─────────────┘
| Format | Extension | Description |
|---|
pytorch | .pt, .pth | PyTorch state_dict or TorchScript |
safetensors | .safetensors | SafeTensors format (recommended) |
onnx | .onnx | ONNX Runtime format |
huggingface | directory | HuggingFace model directory |
Upload a Custom Model
# Create archive
tar -czvf my_model.tar.gz ./model_weights/
# Upload
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
-H "Authorization: Bearer $MIXPEEK_API_KEY" \
-F "file=@my_model.tar.gz" \
-F "name=my-embedding-model" \
-F "version=1.0.0" \
-F "model_format=pytorch" \
-F "framework=pytorch" \
-F "task_type=embedding" \
-F "num_gpus=0" \
-F "memory_gb=4.0"
Deploy to Ray Object Store
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models/my-embedding-model_1_0_0/deploy" \
-H "Authorization: Bearer $MIXPEEK_API_KEY"
Use in Plugins
With LazyModelMixin
from engine.models.lazy import LazyModelMixin
class MyCustomProcessor(LazyModelMixin, BaseBatchInferenceService):
model_id = "my-embedding-model_1_0_0"
model_source = "namespace" # Load from S3
def _instantiate_model(self, weights):
"""Custom instantiation for S3 models."""
import torch
model = torch.nn.Linear(768, 256)
model.load_state_dict(weights)
model.to(self._detect_device())
model.eval()
return model, None
def _process_batch(self, batch):
model, _ = self.get_model()
# Use model...
With Direct Loading
from engine.models.loader import load_namespace_model
class MyCustomProcessor:
def __init__(self, config):
self._model = None
def _ensure_model_loaded(self):
if self._model is not None:
return
# Load from S3 (cached in Ray object store)
weights = load_namespace_model("my-embedding-model_1_0_0")
self._model = torch.nn.Linear(768, 256)
self._model.load_state_dict(weights)
self._model.eval()
def __call__(self, batch):
self._ensure_model_loaded()
# Use self._model...
Choosing the Right Approach
| Approach | Use Case | Caching |
|---|
| Built-in Models | Common tasks, no custom training | Pre-deployed |
| HuggingFace + LazyModelMixin | Standard HF models, easy setup | Cluster-wide |
| HuggingFace + load_hf_model() | Custom instantiation logic needed | Cluster-wide |
| Namespace Models | Fine-tuned or proprietary weights | Cluster-wide |
Decision Tree
Need a model?
├── Is it a common task (embeddings, transcription, etc.)?
│ └── Use built-in models
├── Is it a standard HuggingFace model?
│ ├── Simple usage → LazyModelMixin or @lazy_model
│ └── Custom logic → load_hf_model() + manual instantiation
└── Is it custom/fine-tuned weights?
└── Upload to namespace → LazyModelMixin with model_source="namespace"
Model Versioning
Models are versioned independently for safe rollouts:
my-embedding-model_1_0_0 (production)
my-embedding-model_1_1_0 (staging)
my-embedding-model_2_0_0 (development)
Recommended rollout process:
- Upload new version alongside existing
- Deploy to Ray object store
- Update staging plugins to use new version
- Monitor performance via Analytics API
- Gradually shift production traffic
- Delete old versions when validated
Resource Requirements
When uploading custom models, specify resource requirements:
| Parameter | Description | Default |
|---|
num_cpus | CPU cores required | 1.0 |
num_gpus | GPU devices required | 0 |
memory_gb | Memory allocation in GB | 4.0 |
# GPU model with high memory
-F "num_gpus=1" \
-F "memory_gb=16.0"
# CPU-only lightweight model
-F "num_cpus=0.5" \
-F "num_gpus=0" \
-F "memory_gb=2.0"
Python SDK
from mixpeek import Mixpeek
client = Mixpeek(api_key="sk_...")
# Upload model
with open("my_model.tar.gz", "rb") as f:
result = client.models.upload(
namespace_id="ns_abc123",
file=f,
name="my-reranker",
version="1.0.0",
model_format="pytorch",
task_type="reranking",
)
# Deploy model
client.models.deploy(
namespace_id="ns_abc123",
model_id=result.model_id,
)
# List models
models = client.models.list(namespace_id="ns_abc123")
Limits & Quotas
| Limit | Value |
|---|
| Max models per namespace | 50 |
| Max archive size | 10 GB |
| Supported formats | pytorch, safetensors, onnx, huggingface |
| Custom models tier | Enterprise |
Troubleshooting
”Model not loading on first batch”
Ensure you’re using lazy loading correctly:
# Wrong - loads at __init__
def __init__(self):
self._model = load_hf_model(...) # Don't do this
# Right - loads on first use
def _process_batch(self, batch):
self._ensure_model_loaded() # Loads here
“Custom models require Enterprise tier”
Your organization must be on the Enterprise plan to use custom models. Contact sales to upgrade.
Model deployment fails
- Verify the archive format is valid
.tar.gz
- Check that weights match the declared
model_format
- Ensure sufficient memory is allocated
- Check engine health via
/v1/health
Model not found in plugin
- Verify the model was deployed (not just uploaded)
- Check
deployed: true in model details
- Ensure
namespace_id matches in plugin configuration
Slow inference
- Pre-deploy models before inference to warm the cache
- Check
cached: true in deployment response
- Consider increasing
num_gpus for large models
- Use
load_hf_model_async() to parallelize model loading
HuggingFace model not caching
- Ensure you’re using
load_hf_model() not direct HF loading
- Check that ModelRegistry actor is running
- Verify cluster connectivity between workers