Core Entities
Objects
Raw inputs registered in buckets. Objects hold blobs (video, image, text, audio) and metadata but are not processed until added to a batch.
Documents
Processed representations created by feature extractors. Documents live in collections and include vectors, metadata, and lineage references.
Features
Extracted representations (embeddings, classifications, segments) stored as Qdrant vectors and referenced by feature URIs.
Transformation Flow
1. Object Registration
Objects are created via the Buckets API and validated against the bucket’s JSON schema:- object_id– unique identifier
- bucket_id– parent bucket reference
- key_prefix– logical path/grouping
- blobs[]– array of file references or inline data
- metadata– custom JSON validated against bucket schema
- created_at/- updated_at– audit timestamps
2. Batch Processing
Batches group objects for efficient parallel processing:- Resolves all collections that consume the bucket
- Generates per-extractor artifacts (manifests) and uploads to S3
- Dispatches Ray tasks to the Engine with S3 artifact URIs
3. Document Creation
The Engine runs feature extractors in parallel. For each object and each collection:- Download manifest – fetch extractor config and input mappings
- Execute extractor – run model inference (embeddings, classifications, etc.)
- Write to Qdrant – upsert vectors and payload with internal_idtenant filter
- Update metadata – set __fully_enriched,__pipeline_version,source_object_id
- document_id– Qdrant point ID
- collection_id– parent collection
- source_object_id– lineage back to originating object
- root_object_id– if object was derived, traces to original input
- feature_refs[]– array of feature URIs (e.g.,- mixpeek://text_extractor@v1/text_embedding)
- metadata– passthrough fields + enrichments from taxonomies/clusters
- __fully_enriched– boolean flag indicating all extractors succeeded
- __missing_features– array of feature addresses that failed
- __pipeline_version– integer tracking collection schema version
4. Feature Storage
Features are stored as Qdrant vectors with named indexes: Feature URI Format:- Retriever stage configurations (feature_address)
- Taxonomy input mappings
- Join enrichment strategies
Lineage Tracking
Every document maintains lineage metadata for auditability and debugging:| Field | Purpose | 
|---|---|
| source_object_id | The immediate parent object | 
| root_object_id | The original root object (for derived documents) | 
| collection_id | Which collection produced this document | 
| __pipeline_version | Collection schema version at processing time | 
Querying Lineage
Use the Document Lineage API to trace provenance:Multi-Level Decomposition
Some extractors produce multiple documents per object (e.g., video → scenes, PDF → pages):- source_object_id→ points to- video_file.mp4object
- root_object_id→ same as- source_object_id(unless video itself was derived)
- parent_document_id→ points to- full_video_summarydocument (if hierarchical)
- segment_metadata→- { start_time: 0.0, end_time: 15.0 }
Tenant Isolation
All entities are scoped by:- internal_id(organization) – injected at API layer from- Authorizationheader
- namespace_id– resolved from- X-Namespaceheader
internal_id filters. This ensures hard multi-tenancy without application-level filtering.
Schema Evolution
Collections track__pipeline_version to handle schema changes:
- Update collection definition (add/remove extractors, change mappings)
- Increment pipeline_versionautomatically
- New batches write documents with updated version
- Old documents remain queryable with legacy schema
- Optional: Trigger reprocessing to backfill with new extractors
Feature Reuse Across Collections
Feature extractors are reusable. Multiple collections can reference the same extractor with different input mappings:mixpeek://text_extractor@v1/text_embedding features, enabling cross-collection search via retrievers that span multiple collection IDs.
Document Lifecycle States
| State | Meaning | 
|---|---|
| pending | Batch submitted but Engine hasn’t started processing | 
| processing | Feature extraction in progress | 
| completed | All extractors succeeded ( __fully_enriched: true) | 
| partial | Some extractors failed ( __missing_featurespopulated) | 
| enriching | Materialized taxonomy/cluster enrichment running | 
__fully_enriched in retriever filters to exclude incomplete documents.
Best Practices
Design bucket schemas for validation, not storage
Design bucket schemas for validation, not storage
Bucket schemas enforce input shape but don’t constrain downstream processing. Keep them minimal and use collection 
field_passthrough to propagate only what’s needed.Use key_prefix for logical grouping
Use key_prefix for logical grouping
Organize objects with 
key_prefix (e.g., /catalog/electronics, /users/avatars) to enable bulk operations and filtering without custom metadata.Batch efficiently
Batch efficiently
Group 100-1000 objects per batch for optimal throughput. Smaller batches add orchestration overhead; larger batches delay feedback and complicate retries.
Leverage lineage for debugging
Leverage lineage for debugging
When retrieval results are unexpected, use lineage APIs to trace documents back to source objects and inspect raw input data and processing history.
Version collections explicitly
Version collections explicitly
When changing extractors or mappings, create a new collection rather than mutating the existing one. This preserves reproducibility and simplifies rollback.
Example: E-Commerce Product Pipeline
Next Steps
- Learn how Feature Extractors transform objects into features
- Explore Collections configuration and output schemas
- Review Document Lineage API for tracing provenance
- Understand Namespaces for tenant isolation

