Understanding these concepts will help you utilize the Mixpeek Multimodal Warehouse offerings.

Mixpeek organizes data in a structured hierarchy designed for flexibility and performance of multimodal content processing and retrieval.

Mixpeek TermDescriptionData Warehouse Analogy
NamespaceQuery boundaries that isolate environmentsDatabase/Schema
BucketStorage containers for raw objects and filesRaw Data Lake/Storage Layer
ObjectCollections of related input filesRaw Data Files/Source Documents
BlobIndividual raw files within ObjectsBinary Data/Single File
CollectionGroups of processed documents with consistent schemaTable
DocumentStructured outputs from feature extractorsRow
Feature ExtractorSpecialized components that process inputs to extract specific featuresETL Process/Transformation
FeatureExtracted data elements stored in feature storesColumn/Field
Feature StoreSpecialized storage for extracted features optimized for efficient retrievalIndexed Columns/Materialized Views
RetrieverQuery engines that search feature stores to find relevant documentsSQL Query Engine
Retriever StageComponents of search pipelines that perform specific operations in the retrieval processQuery Execution Plan Step
TaxonomyMultimodal equivalent of SQL JOIN operationsJOIN Operation
ClusteringMultimodal equivalent of SQL GROUP BY operationsGROUP BY Operation
ResearchMulti-step process that explores topics through iterative searches, generates structured reports with sections, and combines retrieved information into cohesive contentBusiness Intelligence Report

Component Relationships

The different components in Mixpeek relate to each other in specific ways:

Understanding the Relationships

Processing Components

Feature Extractors

Specialized components that process inputs to extract specific features like embeddings, detected objects, or transcriptions

Retrievers

Query engines that search feature stores to find relevant documents

Multimodal Analogs to SQL Operations

Mixpeek provides specialized components that function as multimodal analogs to traditional SQL operations:

Taxonomies

Taxonomies in Mixpeek serve as the multimodal equivalent of SQL JOIN operations. They allow you to enrich documents with metadata from other collections based on feature similarity rather than exact key matches.

Data Flow Architecture

1

Storage Layer (Buckets)

Raw objects and their associated files are stored in buckets. Objects represent collections of related files (e.g., a marketing campaign with video, script, and legal documents).

2

Processing Flow (Feature Extrctors)

Objects from buckets are processed through feature extractors. Feature extractors extract various features from the object’s files, which are then organized into documents stored in collections.

3

Feature Storage

Extracted features are stored in specialized feature stores. Each feature maintains a reference to its parent document, and each document maintains a reference to its source object.

4

Retrieval Flow (Retrieval Pipelines)

Queries are processed through retrieval pipelines that search feature stores to find relevant features. Features are used to locate their parent documents in collections.

Metadata and Document Properties

All documents in Mixpeek collections include standard metadata properties:

{
  "__fully_enriched": true,           // Indicates if all expected features have been extracted
  "__missing_features": [],           // Lists any features that failed to extract
  "__pipeline_version": 1,            // Version of the pipeline that processed this document
  "source_object_id": "obj_123abc"    // Reference to the source object in a bucket
  // Additional document-specific fields...
}

Fields prefixed with double underscores (__) are reserved for system metadata. Do not use this prefix for your custom fields.

Next Steps

Now that you understand the core concepts of Mixpeek, you’re ready to start building with the platform: