Skip to main content
Mixpeek provides multiple ways to run inference with ML models:
  1. Built-in Models - Pre-configured models served on Ray Serve
  2. HuggingFace Models - Any HF model with automatic cluster-wide caching
  3. Custom Models (Enterprise) - Your own uploaded weights via S3
This page covers all approaches, including the LazyModelMixin for efficient model loading in plugins.

Built-in Models

Mixpeek ships with curated models optimized for common tasks:
CategoryExamplesPrimary Use
Embeddingsmultilingual-e5-large-instruct, gte-modernbert-base, clip_vit_l_14Semantic search, multimodal similarity
Sparsesplade_v1Hybrid and lexical search
Multi-Vectorcolbertv2Late interaction retrieval
Rerankersbge-reranker-v2-m3, cross-encoderReordering search results
Generationgpt-4, claude-3-opus, gemini-proSummaries, transformations
Audiowhisper_large_v3_turbo, pyannote-segmentationTranscription, diarization
Call /v1/feature-extractors and /v1/retrievers/stages to discover supported models programmatically.

HuggingFace Models with Cluster Caching

Load any HuggingFace model with automatic cluster-wide caching. The ModelRegistry downloads the model once and shares it across all actors via Ray’s object store.

Direct Loading

from engine.models.loader import load_hf_model

class MyProcessor:
    def __init__(self, config, **kwargs):
        self._model = None
        self._tokenizer = None
        self.config = config

    def _ensure_model_loaded(self):
        if self._model is not None:
            return

        # Loads from HF (first time) or Ray cache (subsequent calls)
        cached = load_hf_model(
            hf_model_id="intfloat/multilingual-e5-large-instruct",
            model_class="AutoModel",
            tokenizer_class="AutoTokenizer",
            torch_dtype="float16",
        )

        # Instantiate from cached state dict
        from transformers import AutoModel, AutoConfig, AutoTokenizer

        config = AutoConfig.from_dict(cached["config"])
        self._model = AutoModel.from_config(config)
        self._model.load_state_dict(cached["state_dict"])
        self._model.to("cuda")
        self._model.eval()

        # Load tokenizer
        self._tokenizer = AutoTokenizer.from_pretrained(
            cached["tokenizer_config"]["tokenizer_dir"]
        )

    def __call__(self, batch):
        self._ensure_model_loaded()
        # Use self._model and self._tokenizer...

Return Value

load_hf_model() returns a dictionary containing:
{
    "state_dict": dict,          # Model weights (load with model.load_state_dict())
    "config": dict,              # Model config (use with AutoConfig.from_dict())
    "tokenizer_config": {        # Tokenizer info (if tokenizer_class provided)
        "tokenizer_dir": str,    # Path to tokenizer files
        "class": str,            # Tokenizer class name
    },
    "model_class": str,          # The model class used
    "hf_model_id": str,          # The HuggingFace model ID
}

Async Loading for Parallel Models

Load multiple models in parallel with the async API:
from engine.models.loader import load_hf_model_async
import ray

# Start loading both models in parallel
refs = [
    load_hf_model_async("intfloat/multilingual-e5-large-instruct"),
    load_hf_model_async("openai/clip-vit-large-patch14"),
]

# Wait for both to complete
e5_data, clip_data = ray.get(refs)

Check Cache Status

from engine.models.loader import is_hf_model_cached

if is_hf_model_cached("intfloat/multilingual-e5-large-instruct"):
    print("Model already cached in cluster")

The LazyModelMixin provides automatic lazy loading with cluster-wide caching. Models only load when first needed, not at actor creation time.

Benefits

  • Resource Efficiency: Models load on first batch, not at actor creation
  • Cluster Caching: One download shared across all actors
  • Zero-Copy Sharing: Ray object store provides zero-copy access
  • Automatic Device Detection: Finds CUDA, MPS, or CPU automatically

Usage

from engine.models.lazy import LazyModelMixin
from engine.inference.services import BaseBatchInferenceService

class MyEmbeddingProcessor(LazyModelMixin, BaseBatchInferenceService):
    # Configure model via class attributes
    model_id = "intfloat/multilingual-e5-large-instruct"
    model_class = "AutoModel"
    tokenizer_class = "AutoTokenizer"
    torch_dtype = "float16"
    model_source = "huggingface"  # or "namespace" for S3 models

    def __init__(self, config, **kwargs):
        super().__init__(**kwargs)
        self.config = config
        # Don't load model here - LazyModelMixin handles it

    def _process_batch(self, batch):
        # Model is automatically loaded on first call
        model, tokenizer = self.get_model()

        # Process batch
        inputs = tokenizer(
            batch["text"].tolist(),
            padding=True,
            truncation=True,
            return_tensors="pt",
        )

        with torch.no_grad():
            outputs = model(**inputs)

        batch["embedding"] = outputs.last_hidden_state.mean(dim=1).tolist()
        return batch

Class Attributes

AttributeTypeDefaultDescription
model_idstr""HuggingFace model ID or namespace model ID
model_classstr"AutoModel"Transformers model class name
tokenizer_classstr | None"AutoTokenizer"Tokenizer class, or None to skip
torch_dtypestr"float32"Torch dtype: “float16”, “float32”, “bfloat16”
model_sourcestr"huggingface"”huggingface” or “namespace”

Methods

MethodDescription
get_model()Returns (model, tokenizer) tuple, loading if needed
ensure_model_loaded()Explicitly trigger model loading
modelProperty access to model
tokenizerProperty access to tokenizer

Custom Model Instantiation

Override _instantiate_model() for custom loading logic:
class MyCustomProcessor(LazyModelMixin, BaseBatchInferenceService):
    model_id = "my-custom-model"
    model_source = "namespace"  # Load from S3

    def _instantiate_model(self, cached_data):
        """Custom instantiation for non-standard models."""
        import torch

        # Create your model architecture
        model = torch.nn.Sequential(
            torch.nn.Linear(768, 512),
            torch.nn.ReLU(),
            torch.nn.Linear(512, 256),
        )

        # Load weights from cached data
        model.load_state_dict(cached_data)
        model.to(self._detect_device())
        model.eval()

        return model, None  # No tokenizer

@lazy_model Decorator

For simpler cases, use the @lazy_model decorator instead of the mixin:
from engine.models.lazy import lazy_model

class MyProcessor:
    def __init__(self, config, **kwargs):
        self.config = config
        self._model = None
        self._tokenizer = None

    @lazy_model(
        model_id="intfloat/multilingual-e5-large-instruct",
        model_class="AutoModel",
        tokenizer_class="AutoTokenizer",
        torch_dtype="float16",
    )
    def __call__(self, batch):
        # self._model and self._tokenizer are automatically available
        inputs = self._tokenizer(
            batch["text"].tolist(),
            return_tensors="pt",
            padding=True,
        )

        with torch.no_grad():
            outputs = self._model(**inputs)

        batch["embedding"] = outputs.last_hidden_state.mean(dim=1).tolist()
        return batch

Decorator Parameters

ParameterTypeDefaultDescription
model_idstrrequiredHuggingFace or namespace model ID
model_classstr"AutoModel"Transformers model class
tokenizer_classstr | None"AutoTokenizer"Tokenizer class
torch_dtypestr"float32"Torch dtype
model_sourcestr"huggingface"”huggingface” or “namespace”

Available Attributes After Decoration

The decorator sets these attributes on self:
  • self._model - The loaded model
  • self._tokenizer - The loaded tokenizer (or None)
  • self._device - The device the model is on (“cuda”, “mps”, or “cpu”)

Custom Models (Enterprise)

Custom models require an Enterprise tier subscription. Contact sales to upgrade.
Upload your own trained weights—fine-tuned embeddings, domain-specific classifiers, or proprietary architectures—and run them on Mixpeek’s infrastructure.

How It Works

  1. Upload your model archive (.tar.gz) containing weights
  2. Deploy to the Ray object store for fast access
  3. Use in custom plugins via load_namespace_model()
┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Upload    │────▶│   Deploy    │────▶│    Use      │
│  (S3 store) │     │ (Ray cache) │     │ (Inference) │
└─────────────┘     └─────────────┘     └─────────────┘

Supported Formats

FormatExtensionDescription
pytorch.pt, .pthPyTorch state_dict or TorchScript
safetensors.safetensorsSafeTensors format (recommended)
onnx.onnxONNX Runtime format
huggingfacedirectoryHuggingFace model directory

Upload a Custom Model

# Create archive
tar -czvf my_model.tar.gz ./model_weights/

# Upload
curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY" \
  -F "file=@my_model.tar.gz" \
  -F "name=my-embedding-model" \
  -F "version=1.0.0" \
  -F "model_format=pytorch" \
  -F "framework=pytorch" \
  -F "task_type=embedding" \
  -F "num_gpus=0" \
  -F "memory_gb=4.0"

Deploy to Ray Object Store

curl -X POST "$MIXPEEK_API_URL/v1/namespaces/$NAMESPACE_ID/models/my-embedding-model_1_0_0/deploy" \
  -H "Authorization: Bearer $MIXPEEK_API_KEY"

Use in Plugins

With LazyModelMixin

from engine.models.lazy import LazyModelMixin

class MyCustomProcessor(LazyModelMixin, BaseBatchInferenceService):
    model_id = "my-embedding-model_1_0_0"
    model_source = "namespace"  # Load from S3

    def _instantiate_model(self, weights):
        """Custom instantiation for S3 models."""
        import torch

        model = torch.nn.Linear(768, 256)
        model.load_state_dict(weights)
        model.to(self._detect_device())
        model.eval()
        return model, None

    def _process_batch(self, batch):
        model, _ = self.get_model()
        # Use model...

With Direct Loading

from engine.models.loader import load_namespace_model

class MyCustomProcessor:
    def __init__(self, config):
        self._model = None

    def _ensure_model_loaded(self):
        if self._model is not None:
            return

        # Load from S3 (cached in Ray object store)
        weights = load_namespace_model("my-embedding-model_1_0_0")

        self._model = torch.nn.Linear(768, 256)
        self._model.load_state_dict(weights)
        self._model.eval()

    def __call__(self, batch):
        self._ensure_model_loaded()
        # Use self._model...

Choosing the Right Approach

ApproachUse CaseCaching
Built-in ModelsCommon tasks, no custom trainingPre-deployed
HuggingFace + LazyModelMixinStandard HF models, easy setupCluster-wide
HuggingFace + load_hf_model()Custom instantiation logic neededCluster-wide
Namespace ModelsFine-tuned or proprietary weightsCluster-wide

Decision Tree

Need a model?
├── Is it a common task (embeddings, transcription, etc.)?
│   └── Use built-in models
├── Is it a standard HuggingFace model?
│   ├── Simple usage → LazyModelMixin or @lazy_model
│   └── Custom logic → load_hf_model() + manual instantiation
└── Is it custom/fine-tuned weights?
    └── Upload to namespace → LazyModelMixin with model_source="namespace"

Model Versioning

Models are versioned independently for safe rollouts:
my-embedding-model_1_0_0  (production)
my-embedding-model_1_1_0  (staging)
my-embedding-model_2_0_0  (development)
Recommended rollout process:
  1. Upload new version alongside existing
  2. Deploy to Ray object store
  3. Update staging plugins to use new version
  4. Monitor performance via Analytics API
  5. Gradually shift production traffic
  6. Delete old versions when validated

Resource Requirements

When uploading custom models, specify resource requirements:
ParameterDescriptionDefault
num_cpusCPU cores required1.0
num_gpusGPU devices required0
memory_gbMemory allocation in GB4.0
# GPU model with high memory
-F "num_gpus=1" \
-F "memory_gb=16.0"

# CPU-only lightweight model
-F "num_cpus=0.5" \
-F "num_gpus=0" \
-F "memory_gb=2.0"

Python SDK

from mixpeek import Mixpeek

client = Mixpeek(api_key="sk_...")

# Upload model
with open("my_model.tar.gz", "rb") as f:
    result = client.models.upload(
        namespace_id="ns_abc123",
        file=f,
        name="my-reranker",
        version="1.0.0",
        model_format="pytorch",
        task_type="reranking",
    )

# Deploy model
client.models.deploy(
    namespace_id="ns_abc123",
    model_id=result.model_id,
)

# List models
models = client.models.list(namespace_id="ns_abc123")

Limits & Quotas

LimitValue
Max models per namespace50
Max archive size10 GB
Supported formatspytorch, safetensors, onnx, huggingface
Custom models tierEnterprise


Troubleshooting

”Model not loading on first batch”

Ensure you’re using lazy loading correctly:
# Wrong - loads at __init__
def __init__(self):
    self._model = load_hf_model(...)  # Don't do this

# Right - loads on first use
def _process_batch(self, batch):
    self._ensure_model_loaded()  # Loads here

“Custom models require Enterprise tier”

Your organization must be on the Enterprise plan to use custom models. Contact sales to upgrade.

Model deployment fails

  1. Verify the archive format is valid .tar.gz
  2. Check that weights match the declared model_format
  3. Ensure sufficient memory is allocated
  4. Check engine health via /v1/health

Model not found in plugin

  1. Verify the model was deployed (not just uploaded)
  2. Check deployed: true in model details
  3. Ensure namespace_id matches in plugin configuration

Slow inference

  1. Pre-deploy models before inference to warm the cache
  2. Check cached: true in deployment response
  3. Consider increasing num_gpus for large models
  4. Use load_hf_model_async() to parallelize model loading

HuggingFace model not caching

  1. Ensure you’re using load_hf_model() not direct HF loading
  2. Check that ModelRegistry actor is running
  3. Verify cluster connectivity between workers