Skip to main content
The Mixpeek platform is intentionally split across an API layer, an Engine layer, and a shared storage layer. This separation keeps ingestion, enrichment, and retrieval isolated while still sharing a common data model.

System Overview

  • API Layer (FastAPI + Celery) handles HTTP requests, validation, authorization, transactional metadata updates, task creation, and webhook dispatch.
  • Engine Layer (Ray) performs distributed feature extraction, inference, taxonomy materialization, clustering, and long-running compute.
  • Shared Storage (MongoDB, Qdrant, Redis, S3) acts as the contract between layers. No direct imports cross the API ↔ Engine boundary.

Service Architecture

  • Ray pollers read batches and clustering_jobs from MongoDB, submit jobs, and update status.
  • Ray Serve hosts embedding, reranking, audio, and video models.
  • Celery Beat scans MongoDB for webhook events and dispatches handlers.

Layer Separation

Golden Rule: API and Engine code never import each other. Shared models sit in a neutral library that both layers consume.
  • API defines Pydantic schemas, orchestrates workflows, and updates MongoDB + Redis.
  • Engine consumes manifests, runs ML workloads, and writes vectors/documents to Qdrant.
  • A common library provides enums, models, constants, and utilities that both layers consume.

Data Flow

Ingestion (Object → Document)

  1. Client uploads objects (/v1/buckets/{bucket}/objects) — metadata lands in MongoDB, blobs in S3.
  2. Client submits batch (/v1/buckets/{bucket}/batches/{batch}/submit) — API flattens manifests into per-extractor artifacts in S3 and creates tasks.
  3. Ray poller picks up the batch, runs feature extractors tier-by-tier, writes documents to Qdrant, and emits webhook events.
  4. Celery Beat processes webhook events: updates collection schemas, invalidates caches, notifies external systems.

Retrieval (Query → Results)

  1. Client executes retriever (/v1/retrievers/{id}/execute) with structured inputs and optional filters.
  2. FastAPI loads the retriever definition, validates inputs, and walks the stage pipeline.
  3. Stages call Ray Serve (for inference), Qdrant (for search), MongoDB (for joins), and Redis (for cache signatures).
  4. Response returns results, stage metrics, cache hints (ETag), and budget usage.

Storage Strategy

StorePurposeAccess Pattern
MongoDBMetadata, configs, tasks, webhook eventsCRUD + aggregation
QdrantNamespace-scoped vector stores + payloadsANN search + payload filters
RedisTask queue (Celery) + cache namespacesGet/Set + TTL-managed cache
S3Raw blobs, manifests, per-extractor artifactsLarge object IO
  • MongoDB indexes include (internal_id, namespace_id) to enforce isolation.
  • Qdrant names collections ns_<namespace_id> and stores collection/document identifiers in payload indexes.
  • Redis namespaces keys per feature (collections, retrievers, etc.) with index signatures baked into cache keys.

Ray Cluster

  • Head node hosts the Ray dashboard, job submission API, and pollers.
  • CPU worker pool handles text extraction, manifest parsing, Qdrant writes.
  • GPU worker pool runs embeddings, rerankers, video processing, and Ray Serve replicas.
  • Autoscaling targets: CPU utilization ~70%, GPU utilization ~80%, with configurable min/max replica counts.
  • Custom Ray resources ({"batch": 1}, {"serve": 1}) isolate heavy batch jobs from inference traffic.

Communication Patterns

  1. Task Queue (API → Engine)
    MongoDB stores batch descriptors; pollers submit Ray jobs and track state.
  2. Webhook Events (Engine → API)
    Engine writes webhook events to MongoDB. Celery Beat dispatches cache invalidation, schema updates, and external notifications.
  3. Real-time Inference (API → Engine)
    API calls Ray Serve HTTP endpoints for embeddings, reranking, generation, and audio transcription.

Scaling & Deployment

  • API: scale FastAPI replicas horizontally via Kubernetes HPA; Celery workers scale with queue depth.
  • Engine: autoscale Ray worker pools (CPU/GPU). Ray Serve replicates per-model deployments with independent scaling policies.
  • Storage: MongoDB uses replica sets or sharding, Qdrant supports distributed clusters, Redis can run in cluster mode, S3 scales automatically.
  • Deployments: local development relies on Docker Compose + ./start.sh; production runs on Kubernetes (or Anyscale for Ray) with dedicated node groups for API, CPU workers, and GPU workers.

Key Design Decisions

  • Strict layer boundaries keep API deployments lightweight and Engine deployments GPU-optimized.
  • MongoDB + Qdrant pairing combines flexible metadata with high-performance vector search.
  • Webhook-based events replace direct imports between Engine and API, yielding durable, observable communication.
  • Prefork Celery workers support task termination, concurrency, and graceful shutdowns—critical for webhook dispatch and maintenance jobs.
  • Index signatures prevent stale caches without manual invalidation.

Further Reading