Entities & Relationships
| Layer | Entity | What it Represents | Related APIs |
|---|---|---|---|
| Isolation | Organization / API Key | Authentication boundary (Authorization: Bearer …) | Authentications |
| Isolation | Namespace | Tenant or environment boundary (X-Namespace) | Namespaces |
| Storage | Bucket | Schema-validated container for objects | Buckets |
| Storage | Object | Logical record referencing blobs (files/JSON) | Objects |
| Processing | Batch | Submission that feeds objects into extractors | Batches |
| Processing | Collection | Document store + feature extraction recipe | Collections |
| Processing | Feature Extractor | Reusable pipeline component that emits features | Feature Extractors |
| Retrieval | Retriever | Stage-based search pipeline | Retrievers |
| Enrichment | Taxonomy | Retrieval-backed enrichment recipe (flat or hierarchical) | Taxonomies |
| Enrichment | Cluster | Vector-based grouping and enrichment artifacts | Clusters |
| Operations | Task | Status wrapper for asynchronous jobs | Tasks |
| Operations | Webhook | Event notification subscription | Webhooks |
Dual-ID Multi-Tenancy
Mixpeek separates authentication from authorization by using two IDs per organization:organization_id– Short, user-facing identifier returned in API responsesinternal_id– 24-character key used inside services, task payloads, and database documents
X-Namespace unless your organization has a single shared namespace. Enforced rules:
- All MongoDB collections index on
namespace_id - Each namespace maps to a dedicated Qdrant collection (
ns_<namespace_id>) - Redis keys and Ray jobs include namespace prefixes
- Cross-namespace queries are not permitted by design
Object → Document Lineage
Ingestion separates raw objects from processed documents so you can run multiple extraction tiers without duplicating data. Every document tracks:- Tier 0 – Raw object in the bucket
- Tier N – Document produced by another collection (
source_type = "collection") - The
lineage_pathis a denormalized materialized path for fast queries - Collections respect dependency tiers during extraction so downstream collections only execute when inputs are ready
Feature URIs
Every feature emitted by an extractor is addressed with a URI:mixpeek://text_extractor@v1/text_embeddingmixpeek://clip_vit_l_14@v1/image_embeddingmixpeek://video_extractor@v1/scene_embeddings
feature_address), taxonomies, clustering jobs, and analytics. They guarantee query-time model compatibility with the ingestion pipeline.
TaskStatusEnum Standard
All asynchronous operations—batches, clustering jobs, taxonomy materialization, namespace migrations—report status using the sharedTaskStatusEnum:
IN_PROGRESS, CANCELED, SKIPPED, UNKNOWN, DRAFT, ACTIVE, ARCHIVED, and SUSPENDED. Use the Tasks API for short-term polling and fall back to the resource (e.g., batch or cluster) for long-running workflows.
Caching Signatures
Mixpeek uses deterministic signatures to avoid stale results:- Collection index signatures hash document count, vector dimensions, and schema state
- Retriever caches incorporate the collection signature to invalidate automatically
- Stage-level caches speed up pipelines that reuse expensive stages (KNN → rerank)
- Inference cache shortcuts repeated embedding requests for identical inputs
Putting It Together
- Documents retain lineage to the original object (
root_object_id) - Enrichment layers (taxonomies, clustering) augment documents in place
- Retrievers run on namespace-scoped data, returning results with presigned URLs, metrics, and cache hints

