Skip to main content
The Mixpeek platform is intentionally split across an API layer, an Engine layer, and a shared storage layer. This separation keeps ingestion, enrichment, and retrieval isolated while still sharing a common data model.

System Overview

API Layer

FastAPI and Celery handle HTTP requests, validation, authorization, metadata updates, task creation, and webhook dispatch.

Engine Layer

Ray orchestrates distributed feature extraction, inference, taxonomy materialization, clustering, and other long-running compute.

Shared Storage

MongoDB, Qdrant, Redis, and S3 provide the contract between layers—no direct imports cross the API ↔ Engine boundary.

Service Architecture

  • Ray pollers read batches and clustering_jobs from MongoDB, submit jobs, and update status.
  • Ray Serve hosts embedding, reranking, audio, and video models.
  • Celery Beat scans MongoDB for webhook events and dispatches handlers.

Layer Separation

Golden Rule: API and Engine code never import each other. Shared models sit in a neutral library that both layers consume.
  • API defines Pydantic schemas, orchestrates workflows, and updates MongoDB + Redis.
  • Engine consumes manifests, runs ML workloads, and writes vectors/documents to Qdrant.
  • A common library provides enums, models, constants, and utilities that both layers consume.

Data Flow

Ingestion (Object → Document)

1

Upload objects

Client uploads objects (/v1/buckets/{bucket}/objects)—metadata lands in MongoDB, blobs in S3.
2

Submit batch

Client submits a batch (/v1/buckets/{bucket}/batches/{batch}/submit); the API flattens manifests into per-extractor artifacts in S3 and creates tasks.
3

Process in Ray

Ray pollers pick up the batch, execute feature extractors tier-by-tier, write documents to Qdrant, and emit webhook events.
4

Dispatch events

Celery Beat processes webhook events, updates collection schemas, invalidates caches, and notifies external systems.

Retrieval (Query → Results)

1

Execute retriever

Client calls /v1/retrievers/{id}/execute with structured inputs and optional filters.
2

Validate & plan

FastAPI loads the retriever definition, validates inputs, and walks the configured stage pipeline.
3

Invoke services

Stages call Ray Serve for inference, Qdrant for search, MongoDB for joins, and Redis for cache signatures.
4

Return results

Response includes results, stage metrics, cache hints (ETag), and budget usage.

Storage Strategy

StorePurposeAccess Pattern
MongoDBMetadata, configs, tasks, webhook eventsCRUD + aggregation
QdrantNamespace-scoped vector stores + payloadsANN search + payload filters
RedisTask queue (Celery) + cache namespacesGet/Set + TTL-managed cache
S3Raw blobs, manifests, per-extractor artifactsLarge object IO
  • MongoDB indexes include (internal_id, namespace_id) to enforce isolation.
  • Qdrant names collections ns_<namespace_id> and stores collection/document identifiers in payload indexes.
  • Redis namespaces keys per feature (collections, retrievers, etc.) with index signatures baked into cache keys.

Ray Cluster

  • Head node hosts the Ray dashboard, job submission API, and pollers.
  • CPU worker pool handles text extraction, manifest parsing, Qdrant writes.
  • GPU worker pool runs embeddings, rerankers, video processing, and Ray Serve replicas.
  • Autoscaling targets: CPU utilization ~70%, GPU utilization ~80%, with configurable min/max replica counts.
  • Custom Ray resources ({"batch": 1}, {"serve": 1}) isolate heavy batch jobs from inference traffic.

Communication Patterns

  • Task Queue (API → Engine)
  • Webhook Events (Engine → API)
  • Real-time Inference (API → Engine)
MongoDB stores batch descriptors; Ray pollers submit jobs and track state on behalf of the Engine.

Scaling & Deployment

  • API: scale FastAPI replicas horizontally via Kubernetes HPA; Celery workers scale with queue depth.
  • Engine: autoscale Ray worker pools (CPU/GPU). Ray Serve replicates per-model deployments with independent scaling policies.
  • Storage: MongoDB uses replica sets or sharding, Qdrant supports distributed clusters, Redis can run in cluster mode, S3 scales automatically.
  • Deployments: local development relies on Docker Compose + ./start.sh; production runs on Kubernetes (or Anyscale for Ray) with dedicated node groups for API, CPU workers, and GPU workers.

Key Design Decisions

Keep API deployments lightweight and Engine deployments GPU-optimized by isolating codebases and deployment concerns.
Combine flexible metadata storage with high-performance vector search to serve semantic retrieval and filtering.
Replace direct imports between Engine and API with durable, observable communication channels that can be retried.
Prefork workers support task termination, concurrency, and graceful shutdowns—critical for webhook dispatch and maintenance jobs.
Index signatures prevent stale caches and keep retrievers consistent without manual invalidation.

Further Reading