source.type to "collection", you create a tiered DAG that Mixpeek processes in dependency order.
Tiered DAG Execution
- API builds a dependency graph when you submit a batch.
- Ray poller processes Tier 0 collections (bucket sources) in parallel.
- Once Tier 0 completes, Tier 1 collections (collection sources) execute.
- The cycle continues until the deepest tier finishes.
root_object_id, source_collection_id, source_document_id, and lineage_path, enabling complete traceability back to the original object. The DAG executes in parallel where possible—Tier 1 collections process simultaneously, then all Tier 2 collections execute once their parents complete.
Configuration Pattern
Manifest Flattening
When you submit a batch Mixpeek:- Flattens objects into per-extractor row artifacts (Parquet).
- Stores artifacts in S3 (
manifest_key,extractor_row_artifacts). - Passes artifact references to Ray jobs for each collection tier.
- Ensures each extractor processes the correct subset of inputs.
Enrichment Within Pipelines
- Add taxonomy applications to collections to materialize enrichment immediately after extractor runs.
- Trigger clustering jobs on collection outputs to assign cluster metadata.
- Downstream collections can reference enriched fields from upstream tiers (e.g., taxonomy tags).
Observability
GET /v1/collections/{id}showsdocument_count,taxonomy_count, andretriever_count.- Ray poller logs and task metadata (
/v1/tasks/{task_id}) capture job IDs and timing. - Use
/v1/objects/{object_id}/decomposition-treeto inspect the entire pipeline lineage for a single object.
Upgrade & Versioning Strategy
- Clone a collection with a new name/version (e.g.,
video_frames_v2) when changing extractors or parameters. - Update downstream collections to point at the new version.
- Run batches through both versions for A/B testing before deprecating the old one.
Best Practices
- Keep each collection focused on a single concern (frame extraction, face detection, embeddings).
- Use metadata passthrough to share relevant fields between tiers.
- Submit batches that contain all objects needed for a multi-tier run to avoid partial states.
- Monitor webhook events to trigger downstream automation once documents land.
- Use retriever JOIN stages to traverse tiers during query time if you prefer on-demand joins over materialization.

