Skip to main content
Mixpeek exposes a small set of typed resources that work together to ingest, process, enrich, and retrieve multimodal data. This page introduces the key abstractions and the rules that keep them isolated, observable, and composable.

Entities & Relationships

LayerEntityWhat it RepresentsRelated APIs
IsolationOrganization / API KeyAuthentication boundary (Authorization: Bearer …)Authentications
IsolationNamespaceTenant or environment boundary (X-Namespace)Namespaces
StorageBucketSchema-validated container for objectsBuckets
StorageObjectLogical record referencing blobs (files/JSON)Objects
ProcessingBatchSubmission that feeds objects into extractorsBatches
ProcessingCollectionDocument store + feature extraction recipeCollections
ProcessingFeature ExtractorReusable pipeline component that emits featuresFeature Extractors
RetrievalRetrieverStage-based search pipelineRetrievers
EnrichmentTaxonomyRetrieval-backed enrichment recipe (flat or hierarchical)Taxonomies
EnrichmentClusterVector-based grouping and enrichment artifactsClusters
OperationsTaskStatus wrapper for asynchronous jobsTasks
OperationsWebhookEvent notification subscriptionWebhooks

Dual-ID Multi-Tenancy

Mixpeek separates authentication from authorization by using two IDs per organization:
  • organization_id – Short, user-facing identifier returned in API responses
  • internal_id – 24-character key used inside services, task payloads, and database documents
Namespaces are the primary isolation boundary. Every API request must include X-Namespace unless your organization has a single shared namespace. Enforced rules:
  • All MongoDB collections index on namespace_id
  • Each namespace maps to a dedicated Qdrant collection (ns_<namespace_id>)
  • Redis keys and Ray jobs include namespace prefixes
  • Cross-namespace queries are not permitted by design

Object → Document Lineage

Ingestion separates raw objects from processed documents so you can run multiple extraction tiers without duplicating data. Every document tracks:
{
  "root_object_id": "obj_video_123",
  "root_bucket_id": "bkt_marketing",
  "source_type": "collection",
  "source_collection_id": "col_frames",
  "source_document_id": "doc_frame_050",
  "lineage_path": "bkt_marketing/col_frames/col_scenes/col_highlights",
  "processing_tier": 3
}
  • Tier 0 – Raw object in the bucket
  • Tier N – Document produced by another collection (source_type = "collection")
  • The lineage_path is a denormalized materialized path for fast queries
  • Collections respect dependency tiers during extraction so downstream collections only execute when inputs are ready
Use the Object Decomposition Tree endpoint to inspect the entire lineage for a given object.

Feature URIs

Every feature emitted by an extractor is addressed with a URI:
mixpeek://{extractor_name}@{version}/{output_name}
Examples:
  • mixpeek://text_extractor@v1/text_embedding
  • mixpeek://clip_vit_l_14@v1/image_embedding
  • mixpeek://video_extractor@v1/scene_embeddings
Feature URIs are referenced by collections (output schemas), retriever stages (feature_address), taxonomies, clustering jobs, and analytics. They guarantee query-time model compatibility with the ingestion pipeline.

TaskStatusEnum Standard

All asynchronous operations—batches, clustering jobs, taxonomy materialization, namespace migrations—report status using the shared TaskStatusEnum:
PENDING → PROCESSING → COMPLETED

            FAILED
Additional lifecycle values include IN_PROGRESS, CANCELED, SKIPPED, UNKNOWN, DRAFT, ACTIVE, ARCHIVED, and SUSPENDED. Use the Tasks API for short-term polling and fall back to the resource (e.g., batch or cluster) for long-running workflows.

Caching Signatures

Mixpeek uses deterministic signatures to avoid stale results:
  • Collection index signatures hash document count, vector dimensions, and schema state
  • Retriever caches incorporate the collection signature to invalidate automatically
  • Stage-level caches speed up pipelines that reuse expensive stages (KNN → rerank)
  • Inference cache shortcuts repeated embedding requests for identical inputs
Learn more in Caching.

Putting It Together

Namespace
 └── Bucket
      ├── Object (Tier 0)
      └── Batch → Collection (Tier 1)
               └── Collection (Tier 2)
                    └── ...
  • Documents retain lineage to the original object (root_object_id)
  • Enrichment layers (taxonomies, clustering) augment documents in place
  • Retrievers run on namespace-scoped data, returning results with presigned URLs, metrics, and cache hints
With these concepts in mind you can navigate deeper sections of the docs—whether you’re planning ingestion schemas, designing retriever pipelines, or wiring observability for production deployments.