System Overview
- API Layer (FastAPI + Celery) handles HTTP requests, validation, authorization, transactional metadata updates, task creation, and webhook dispatch.
- Engine Layer (Ray) performs distributed feature extraction, inference, taxonomy materialization, clustering, and long-running compute.
- Shared Storage (MongoDB, Qdrant, Redis, S3) acts as the contract between layers. No direct imports cross the API ↔ Engine boundary.
Service Architecture
- Ray pollers read batchesandclustering_jobsfrom MongoDB, submit jobs, and update status.
- Ray Serve hosts embedding, reranking, audio, and video models.
- Celery Beat scans MongoDB for webhook events and dispatches handlers.
Layer Separation
Golden Rule: API and Engine code never import each other. Shared models sit in a neutral library that both layers consume.
- API defines Pydantic schemas, orchestrates workflows, and updates MongoDB + Redis.
- Engine consumes manifests, runs ML workloads, and writes vectors/documents to Qdrant.
- A common library provides enums, models, constants, and utilities that both layers consume.
Data Flow
Ingestion (Object → Document)
- Client uploads objects (/v1/buckets/{bucket}/objects) — metadata lands in MongoDB, blobs in S3.
- Client submits batch (/v1/buckets/{bucket}/batches/{batch}/submit) — API flattens manifests into per-extractor artifacts in S3 and creates tasks.
- Ray poller picks up the batch, runs feature extractors tier-by-tier, writes documents to Qdrant, and emits webhook events.
- Celery Beat processes webhook events: updates collection schemas, invalidates caches, notifies external systems.
Retrieval (Query → Results)
- Client executes retriever (/v1/retrievers/{id}/execute) with structured inputs and optional filters.
- FastAPI loads the retriever definition, validates inputs, and walks the stage pipeline.
- Stages call Ray Serve (for inference), Qdrant (for search), MongoDB (for joins), and Redis (for cache signatures).
- Response returns results, stage metrics, cache hints (ETag), and budget usage.
Storage Strategy
| Store | Purpose | Access Pattern | 
|---|---|---|
| MongoDB | Metadata, configs, tasks, webhook events | CRUD + aggregation | 
| Qdrant | Namespace-scoped vector stores + payloads | ANN search + payload filters | 
| Redis | Task queue (Celery) + cache namespaces | Get/Set + TTL-managed cache | 
| S3 | Raw blobs, manifests, per-extractor artifacts | Large object IO | 
- MongoDB indexes include (internal_id, namespace_id)to enforce isolation.
- Qdrant names collections ns_<namespace_id>and stores collection/document identifiers in payload indexes.
- Redis namespaces keys per feature (collections,retrievers, etc.) with index signatures baked into cache keys.
Ray Cluster
- Head node hosts the Ray dashboard, job submission API, and pollers.
- CPU worker pool handles text extraction, manifest parsing, Qdrant writes.
- GPU worker pool runs embeddings, rerankers, video processing, and Ray Serve replicas.
- Autoscaling targets: CPU utilization ~70%, GPU utilization ~80%, with configurable min/max replica counts.
- Custom Ray resources ({"batch": 1},{"serve": 1}) isolate heavy batch jobs from inference traffic.
Communication Patterns
- 
Task Queue (API → Engine)
 MongoDB stores batch descriptors; pollers submit Ray jobs and track state.
- 
Webhook Events (Engine → API)
 Engine writes webhook events to MongoDB. Celery Beat dispatches cache invalidation, schema updates, and external notifications.
- 
Real-time Inference (API → Engine)
 API calls Ray Serve HTTP endpoints for embeddings, reranking, generation, and audio transcription.
Scaling & Deployment
- API: scale FastAPI replicas horizontally via Kubernetes HPA; Celery workers scale with queue depth.
- Engine: autoscale Ray worker pools (CPU/GPU). Ray Serve replicates per-model deployments with independent scaling policies.
- Storage: MongoDB uses replica sets or sharding, Qdrant supports distributed clusters, Redis can run in cluster mode, S3 scales automatically.
- Deployments: local development relies on Docker Compose + ./start.sh; production runs on Kubernetes (or Anyscale for Ray) with dedicated node groups for API, CPU workers, and GPU workers.
Key Design Decisions
- Strict layer boundaries keep API deployments lightweight and Engine deployments GPU-optimized.
- MongoDB + Qdrant pairing combines flexible metadata with high-performance vector search.
- Webhook-based events replace direct imports between Engine and API, yielding durable, observable communication.
- Prefork Celery workers support task termination, concurrency, and graceful shutdowns—critical for webhook dispatch and maintenance jobs.
- Index signatures prevent stale caches without manual invalidation.
Further Reading
- Processing → Feature Extractors for extractor catalogs and configuration patterns
- Retrieval → Retrievers for stage pipelines, caching, and execution telemetry
- Operations → Deployment for local and production deployment topologies
- Operations → Observability for metrics, logging, and tracing integrations

