curl --request POST \
--url https://api.mixpeek.com/v1/collections/{collection_identifier}/clone \
--header 'Content-Type: application/json' \
--data '
{
"collection_name": "<string>",
"description": "Cloned from product_embeddings with CLIP v2",
"source": {
"bucket_ids": [
"bkt_marketing_videos"
],
"description": "Single bucket source",
"type": "bucket"
},
"feature_extractor": {
"description": "Text extractor with field passthrough",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{
"required": true,
"source_path": "title"
},
{
"source_path": "author"
}
],
"input_mappings": {
"text": "content"
},
"parameters": {
"model": "text-embedding-3-small"
},
"version": "v1"
},
"enabled": true,
"metadata": {},
"taxonomy_applications": [
{
"taxonomy_id": "<string>",
"execution_mode": "on_demand",
"target_collection_id": "<string>",
"scroll_filters": {
"AND": [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
],
"OR": [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
],
"NOT": [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
],
"case_sensitive": true
}
}
]
}
'{
"collection": {
"collection_name": "<string>",
"input_schema": {
"properties": {}
},
"output_schema": {
"properties": {}
},
"feature_extractor": {
"feature_extractor_name": "<string>",
"version": "<string>",
"feature_extractor_id": "<string>",
"parameters": {},
"input_mappings": {
"image": "product_image",
"text": "title"
},
"field_passthrough": [
{
"source_path": "<string>",
"target_path": "title",
"default": "Unknown",
"required": false
}
],
"include_all_source_fields": false
},
"source": {
"type": "bucket",
"bucket_ids": [
"bkt_marketing_videos"
],
"collection_id": "col_video_frames",
"collection_ids": [
"col_us_products",
"col_eu_products"
],
"inherited_bucket_ids": [
"bkt_marketing_videos"
],
"source_filters": {
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}
},
"collection_id": "<string>",
"description": "Video frames extracted at 1 FPS with CLIP embeddings",
"source_bucket_schemas": null,
"source_lineage": [
{
"source_config": {
"type": "bucket",
"bucket_ids": [
"bkt_marketing_videos"
],
"collection_id": "col_video_frames",
"collection_ids": [
"col_us_products",
"col_eu_products"
],
"inherited_bucket_ids": [
"bkt_marketing_videos"
],
"source_filters": {
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}
},
"feature_extractor": {
"feature_extractor_name": "<string>",
"version": "<string>",
"feature_extractor_id": "<string>",
"parameters": {},
"input_mappings": {
"image": "product_image",
"text": "title"
},
"field_passthrough": [
{
"source_path": "<string>",
"target_path": "title",
"default": "Unknown",
"required": false
}
],
"include_all_source_fields": false
},
"output_schema": {
"properties": {}
}
}
],
"vector_indexes": [
"<unknown>"
],
"payload_indexes": [
"<unknown>"
],
"enabled": true,
"metadata": {
"environment": "production",
"project": "Q4_campaign",
"team": "data-science"
},
"created_at": "2023-11-07T05:31:56Z",
"updated_at": "2023-11-07T05:31:56Z",
"document_count": 0,
"schema_version": 1,
"last_schema_sync": "2023-11-07T05:31:56Z",
"schema_sync_enabled": true,
"taxonomy_applications": null,
"cluster_applications": null
},
"source_collection_id": "<string>"
}Clone a collection with optional modifications.
Purpose: Creates a NEW collection (with new ID) based on an existing one. This is the recommended way to iterate on collection designs when you need to modify core configuration that PATCH doesn’t allow (source, feature_extractor, field_passthrough).
Clone vs PATCH vs Template:
Common Use Cases:
How it works:
curl --request POST \
--url https://api.mixpeek.com/v1/collections/{collection_identifier}/clone \
--header 'Content-Type: application/json' \
--data '
{
"collection_name": "<string>",
"description": "Cloned from product_embeddings with CLIP v2",
"source": {
"bucket_ids": [
"bkt_marketing_videos"
],
"description": "Single bucket source",
"type": "bucket"
},
"feature_extractor": {
"description": "Text extractor with field passthrough",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{
"required": true,
"source_path": "title"
},
{
"source_path": "author"
}
],
"input_mappings": {
"text": "content"
},
"parameters": {
"model": "text-embedding-3-small"
},
"version": "v1"
},
"enabled": true,
"metadata": {},
"taxonomy_applications": [
{
"taxonomy_id": "<string>",
"execution_mode": "on_demand",
"target_collection_id": "<string>",
"scroll_filters": {
"AND": [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
],
"OR": [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
],
"NOT": [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
],
"case_sensitive": true
}
}
]
}
'{
"collection": {
"collection_name": "<string>",
"input_schema": {
"properties": {}
},
"output_schema": {
"properties": {}
},
"feature_extractor": {
"feature_extractor_name": "<string>",
"version": "<string>",
"feature_extractor_id": "<string>",
"parameters": {},
"input_mappings": {
"image": "product_image",
"text": "title"
},
"field_passthrough": [
{
"source_path": "<string>",
"target_path": "title",
"default": "Unknown",
"required": false
}
],
"include_all_source_fields": false
},
"source": {
"type": "bucket",
"bucket_ids": [
"bkt_marketing_videos"
],
"collection_id": "col_video_frames",
"collection_ids": [
"col_us_products",
"col_eu_products"
],
"inherited_bucket_ids": [
"bkt_marketing_videos"
],
"source_filters": {
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}
},
"collection_id": "<string>",
"description": "Video frames extracted at 1 FPS with CLIP embeddings",
"source_bucket_schemas": null,
"source_lineage": [
{
"source_config": {
"type": "bucket",
"bucket_ids": [
"bkt_marketing_videos"
],
"collection_id": "col_video_frames",
"collection_ids": [
"col_us_products",
"col_eu_products"
],
"inherited_bucket_ids": [
"bkt_marketing_videos"
],
"source_filters": {
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}
},
"feature_extractor": {
"feature_extractor_name": "<string>",
"version": "<string>",
"feature_extractor_id": "<string>",
"parameters": {},
"input_mappings": {
"image": "product_image",
"text": "title"
},
"field_passthrough": [
{
"source_path": "<string>",
"target_path": "title",
"default": "Unknown",
"required": false
}
],
"include_all_source_fields": false
},
"output_schema": {
"properties": {}
}
}
],
"vector_indexes": [
"<unknown>"
],
"payload_indexes": [
"<unknown>"
],
"enabled": true,
"metadata": {
"environment": "production",
"project": "Q4_campaign",
"team": "data-science"
},
"created_at": "2023-11-07T05:31:56Z",
"updated_at": "2023-11-07T05:31:56Z",
"document_count": 0,
"schema_version": 1,
"last_schema_sync": "2023-11-07T05:31:56Z",
"schema_sync_enabled": true,
"taxonomy_applications": null,
"cluster_applications": null
},
"source_collection_id": "<string>"
}REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.
REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'
Source collection ID or name to clone.
Request to clone a collection with optional modifications.
Purpose: Cloning creates a NEW collection (with new ID) based on an existing one, allowing you to make changes that aren't allowed via PATCH (source, feature_extractor, field_passthrough). This is the recommended way to iterate on collection designs.
Clone vs Template vs Version:
Use Cases:
All fields are OPTIONAL:
REQUIRED. Name for the cloned collection. Must be unique and different from the source collection.
1OPTIONAL. Description override. If omitted, copies from source collection.
"Cloned from product_embeddings with CLIP v2"
OPTIONAL. Override source configuration. If omitted, copies from source collection. Allows switching between buckets or collections.
Show child attributes
REQUIRED. Type of source for this collection. 'bucket': Process objects from one or more buckets (first-stage processing). 'collection': Process documents from another collection (downstream processing). Use 'bucket' for initial data ingestion, 'collection' for decomposition trees.
bucket, collection, taxonomy, cluster List of bucket IDs when type='bucket'. REQUIRED when type='bucket'. NOT ALLOWED when type='collection'. Can specify one or more buckets to process. Single bucket: Use array with one element ['bkt_id']. Multiple buckets: All buckets MUST have compatible schemas. Schema compatibility validated at collection creation. Compatible schemas have: 1) Same field names, 2) Same field types, 3) Same required status. Documents will include root_bucket_id to track which bucket they came from. Use cases: multi-region data, multi-team consolidation, environment aggregation.
1["bkt_marketing_videos"]Collection ID when type='collection' (single collection). Use this OR collection_ids (not both). REQUIRED when type='collection' and processing single collection. NOT ALLOWED when type='bucket'. The collection will process documents from this upstream collection. The upstream collection's output_schema becomes this collection's input_schema. This enables decomposition trees (multi-stage pipelines). Example: Process frames collection → create scenes collection.
"col_video_frames"
List of collection IDs when type='collection' (multiple collections). Use this OR collection_id (not both). REQUIRED when type='collection' and processing multiple collections. NOT ALLOWED when type='bucket'. Used for operations that consolidate multiple upstream collections. Example: Clustering across multiple collections → cluster output collection. All collections must have compatible schemas for consolidation operations.
1["col_us_products", "col_eu_products"]List of original bucket IDs that source collections originated from. OPTIONAL. Only used when type='collection'. Tracks the complete lineage chain: buckets → collections → derived collections. Extracted from upstream collection metadata at collection creation time. Enables tracing derived collections (like cluster outputs) back to original data sources. Example: Cluster output collection inherits bucket IDs from its source collections. Format: List of bucket IDs with 'bkt_' prefix.
["bkt_marketing_videos"]Optional filters to apply to source data. When specified, only objects/documents matching these filters will be processed by this collection. Filters are evaluated at batch creation time. Uses same LogicalOperator model as list APIs for consistency.
Show child attributes
Optional logical filters to apply to source data. Uses LogicalOperator model with AND/OR/NOT support. When specified, only objects/documents matching these filters will be processed by this collection. When null, all source data is processed (no filtering). Filters are consistent across all batch runs for this collection.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
{
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}{
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}{
"bucket_ids": ["bkt_marketing_videos"],
"description": "Single bucket source",
"type": "bucket"
}OPTIONAL. Override feature extractor configuration. If omitted, copies from source collection. This is where you'd change models, parameters, or field_passthrough.
Show child attributes
Name of the feature extractor
Version of the feature extractor
Parameters for the feature extractor (model name, thresholds, etc.). See extractor-specific documentation for available parameters.
Show child attributes
Mapping from extractor input names to source field paths. Tells the extractor which source fields to process. Example: {'image': 'thumbnail_url', 'text': 'description'}
Show child attributes
{ "image": "product_image", "text": "title" }NOT REQUIRED. List of specific fields to pass through from source to output documents. These fields are included alongside extractor-computed features (embeddings, detections, etc.). Empty list = only extractor outputs in documents (default behavior). With entries = specified fields + extractor outputs in documents.
How It Works:
Common Use Cases:
Behavior:
Output Schema: output_schema = field_passthrough fields + extractor output fields Example: ['title', 'category', 'text_extractor_v1_embedding']
Show child attributes
REQUIRED. Path to the source field to copy. Simple fields: Use field name directly (e.g., 'title', 'campaign_id'). Nested fields: Use dot notation (e.g., 'metadata.author', 'config.model.version'). The field must exist in the source bucket schema or upstream collection schema. Without target_path, nested fields are flattened: 'metadata.author' becomes 'author' in output.
OPTIONAL. Target field name in output document. If NOT PROVIDED: Uses source_path name (or last component for nested paths). - 'title' → 'title' - 'metadata.author' → 'author' If PROVIDED: Uses this exact name in output. - source_path='doc_title', target_path='title' → 'title' - source_path='metadata.author', target_path='contributor' → 'contributor' Use cases: - Rename fields for cleaner API schemas - Avoid name conflicts with extractor outputs - Standardize field names across different sources Constraints: - Must not conflict with system fields (document_id, collection_id, etc.) - Must not conflict with extractor output fields - Must be a valid field name (alphanumeric, underscores, hyphens)
"title"
OPTIONAL. Default value if source field doesn't exist or is None. If NOT PROVIDED and field missing: Field is omitted from output document. If PROVIDED and field missing: Field is included with this default value. Type should match expected field type (string, int, list, dict, etc.).
"Unknown"
OPTIONAL. Whether this field MUST exist in source. If True and field missing: Raises validation error, processing fails. If False and field missing: Field omitted (or default used if provided). Use True for: Critical identifiers, required business fields. Use False for: Optional metadata, nice-to-have fields. Default: False (field is optional).
NOT REQUIRED. Whether to include ALL fields from source object/document in output. Default: False (only field_passthrough fields included).
When False (RECOMMENDED):
When True (USE WITH CAUTION):
Use True When:
Use False When (MOST CASES):
Examples: False: source={a,b,c,d} + passthrough=[a,b] → output={a,b,embedding} True: source={a,b,c,d} + passthrough=[a→x] → output={x,b,c,d,embedding}
{
"description": "Text extractor with field passthrough",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{ "required": true, "source_path": "title" },
{ "source_path": "author" }
],
"input_mappings": { "text": "content" },
"parameters": { "model": "text-embedding-3-small" },
"version": "v1"
}OPTIONAL. Override enabled status. If omitted, copies from source collection.
OPTIONAL. Override metadata. If omitted, copies from source collection.
OPTIONAL. Override taxonomy applications. If omitted, copies from source collection.
Show child attributes
ID of the TaxonomyModel to execute.
'on_demand' executes at query time; 'materialize' materialises during ingestion.
on_demand, materialize Optional collection to persist results into when execution_mode is 'materialize'. If omitted, the source collection is updated in-place.
Additional filters applied when scrolling the source collection before enrichment.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
Successful Response
Response after cloning a collection.
Cloned collection configuration with new collection_id.
Show child attributes
REQUIRED. Human-readable name for the collection. Must be unique within the namespace. Used for: Display, lookups (can query by name or ID), organization. Format: Alphanumeric with underscores/hyphens, 3-100 characters. Examples: 'product_embeddings', 'video_frames', 'customer_documents'.
3 - 100REQUIRED (auto-computed from source). Input schema defining fields available to the feature extractor. Source: bucket.bucket_schema (if source.type='bucket') OR upstream_collection.output_schema (if source.type='collection'). Determines: Which fields can be used in input_mappings and field_passthrough. This is the 'left side' of the transformation - what data goes IN. Format: BucketSchema with properties dict. Use for: Validating input_mappings, configuring field_passthrough.
Show child attributes
Show child attributes
Schema field definition for bucket objects.
Show child attributes
Supported data types for bucket schema fields.
Types fall into two categories:
Explicit Types (Type-Safe):
Automatic Type (Flexible):
string, number, integer, float, boolean, object, array, date, datetime, json, file, text, image, audio, video, pdf, document, spreadsheet, presentation, dense_vector, sparse_vector, int8_vector, automatic OPTIONAL. List of example values for this field. Used by Apps to show example inputs in the UI. Provide multiple diverse examples when possible.
REQUIRED (auto-computed at creation). Output schema defining fields in final documents. Computed as: field_passthrough fields + extractor output fields (deterministic). Known IMMEDIATELY when collection is created - no waiting for documents! This is the 'right side' of the transformation - what data comes OUT. Use for: Understanding document structure, building queries, schema validation. Example: {title (passthrough), embedding (extractor output)} = output_schema.
Show child attributes
Show child attributes
Schema field definition for bucket objects.
Show child attributes
Supported data types for bucket schema fields.
Types fall into two categories:
Explicit Types (Type-Safe):
Automatic Type (Flexible):
string, number, integer, float, boolean, object, array, date, datetime, json, file, text, image, audio, video, pdf, document, spreadsheet, presentation, dense_vector, sparse_vector, int8_vector, automatic OPTIONAL. List of example values for this field. Used by Apps to show example inputs in the UI. Provide multiple diverse examples when possible.
REQUIRED. Single feature extractor configuration for this collection. Defines: extractor name/version, input_mappings, field_passthrough, parameters. Task 9 change: ONE extractor per collection (previously supported multiple). For multiple extractors: Create multiple collections and use collection-to-collection pipelines. Use field_passthrough to include additional source fields beyond extractor outputs.
Show child attributes
Name of the feature extractor
Version of the feature extractor
Construct unique identifier for the feature extractor instance (name + version).
Parameters for the feature extractor (model name, thresholds, etc.). See extractor-specific documentation for available parameters.
Show child attributes
A typed value for a feature extractor parameter.
Mapping from extractor input names to source field paths. Tells the extractor which source fields to process. Example: {'image': 'thumbnail_url', 'text': 'description'}
Show child attributes
{ "image": "product_image", "text": "title" }NOT REQUIRED. List of specific fields to pass through from source to output documents. These fields are included alongside extractor-computed features (embeddings, detections, etc.). Empty list = only extractor outputs in documents (default behavior). With entries = specified fields + extractor outputs in documents.
How It Works:
Common Use Cases:
Behavior:
Output Schema: output_schema = field_passthrough fields + extractor output fields Example: ['title', 'category', 'text_extractor_v1_embedding']
Show child attributes
REQUIRED. Path to the source field to copy. Simple fields: Use field name directly (e.g., 'title', 'campaign_id'). Nested fields: Use dot notation (e.g., 'metadata.author', 'config.model.version'). The field must exist in the source bucket schema or upstream collection schema. Without target_path, nested fields are flattened: 'metadata.author' becomes 'author' in output.
OPTIONAL. Target field name in output document. If NOT PROVIDED: Uses source_path name (or last component for nested paths). - 'title' → 'title' - 'metadata.author' → 'author' If PROVIDED: Uses this exact name in output. - source_path='doc_title', target_path='title' → 'title' - source_path='metadata.author', target_path='contributor' → 'contributor' Use cases: - Rename fields for cleaner API schemas - Avoid name conflicts with extractor outputs - Standardize field names across different sources Constraints: - Must not conflict with system fields (document_id, collection_id, etc.) - Must not conflict with extractor output fields - Must be a valid field name (alphanumeric, underscores, hyphens)
"title"
OPTIONAL. Default value if source field doesn't exist or is None. If NOT PROVIDED and field missing: Field is omitted from output document. If PROVIDED and field missing: Field is included with this default value. Type should match expected field type (string, int, list, dict, etc.).
"Unknown"
OPTIONAL. Whether this field MUST exist in source. If True and field missing: Raises validation error, processing fails. If False and field missing: Field omitted (or default used if provided). Use True for: Critical identifiers, required business fields. Use False for: Optional metadata, nice-to-have fields. Default: False (field is optional).
NOT REQUIRED. Whether to include ALL fields from source object/document in output. Default: False (only field_passthrough fields included).
When False (RECOMMENDED):
When True (USE WITH CAUTION):
Use True When:
Use False When (MOST CASES):
Examples: False: source={a,b,c,d} + passthrough=[a,b] → output={a,b,embedding} True: source={a,b,c,d} + passthrough=[a→x] → output={x,b,c,d,embedding}
REQUIRED. Source configuration defining where data comes from. Type 'bucket': Process objects from one or more buckets (tier 1). Type 'collection': Process documents from upstream collection(s) (tier 2+). For multi-bucket sources, all buckets must have compatible schemas. For multi-collection sources, all collections must have compatible schemas. Determines input_schema and enables decomposition trees.
Show child attributes
REQUIRED. Type of source for this collection. 'bucket': Process objects from one or more buckets (first-stage processing). 'collection': Process documents from another collection (downstream processing). Use 'bucket' for initial data ingestion, 'collection' for decomposition trees.
bucket, collection, taxonomy, cluster List of bucket IDs when type='bucket'. REQUIRED when type='bucket'. NOT ALLOWED when type='collection'. Can specify one or more buckets to process. Single bucket: Use array with one element ['bkt_id']. Multiple buckets: All buckets MUST have compatible schemas. Schema compatibility validated at collection creation. Compatible schemas have: 1) Same field names, 2) Same field types, 3) Same required status. Documents will include root_bucket_id to track which bucket they came from. Use cases: multi-region data, multi-team consolidation, environment aggregation.
1["bkt_marketing_videos"]Collection ID when type='collection' (single collection). Use this OR collection_ids (not both). REQUIRED when type='collection' and processing single collection. NOT ALLOWED when type='bucket'. The collection will process documents from this upstream collection. The upstream collection's output_schema becomes this collection's input_schema. This enables decomposition trees (multi-stage pipelines). Example: Process frames collection → create scenes collection.
"col_video_frames"
List of collection IDs when type='collection' (multiple collections). Use this OR collection_id (not both). REQUIRED when type='collection' and processing multiple collections. NOT ALLOWED when type='bucket'. Used for operations that consolidate multiple upstream collections. Example: Clustering across multiple collections → cluster output collection. All collections must have compatible schemas for consolidation operations.
1["col_us_products", "col_eu_products"]List of original bucket IDs that source collections originated from. OPTIONAL. Only used when type='collection'. Tracks the complete lineage chain: buckets → collections → derived collections. Extracted from upstream collection metadata at collection creation time. Enables tracing derived collections (like cluster outputs) back to original data sources. Example: Cluster output collection inherits bucket IDs from its source collections. Format: List of bucket IDs with 'bkt_' prefix.
["bkt_marketing_videos"]Optional filters to apply to source data. When specified, only objects/documents matching these filters will be processed by this collection. Filters are evaluated at batch creation time. Uses same LogicalOperator model as list APIs for consistency.
Show child attributes
Optional logical filters to apply to source data. Uses LogicalOperator model with AND/OR/NOT support. When specified, only objects/documents matching these filters will be processed by this collection. When null, all source data is processed (no filtering). Filters are consistent across all batch runs for this collection.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
{
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}{
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}NOT REQUIRED (auto-generated). Unique identifier for this collection. Used for: API paths, document queries, pipeline references. Format: 'col_' prefix + 10 random alphanumeric characters. Stable after creation - use for all collection references.
NOT REQUIRED. Human-readable description of the collection's purpose. Use for: Documentation, team communication, UI display. Common pattern: Describe what the collection contains and what processing is applied.
"Video frames extracted at 1 FPS with CLIP embeddings"
NOT REQUIRED (auto-computed). Snapshot of bucket schemas at collection creation. Only populated for multi-bucket collections (source.type='bucket' with multiple bucket_ids). Key: bucket_id, Value: BucketSchema at time of collection creation. Used for: Schema compatibility validation, document lineage, debugging. Schema snapshot is immutable - bucket schema changes after collection creation do not affect this. Single-bucket collections may omit this field (schema in input_schema is sufficient).
Show child attributes
Schema definition for bucket objects.
IMPORTANT: The bucket schema defines what fields your bucket objects will have. This schema is REQUIRED if you want to:
The schema defines the custom fields that will be used in:
Example workflow:
Without a bucket_schema, collections cannot use input_mappings.
Show child attributes
Show child attributes
Schema field definition for bucket objects.
Show child attributes
Supported data types for bucket schema fields.
Types fall into two categories:
Explicit Types (Type-Safe):
Automatic Type (Flexible):
string, number, integer, float, boolean, object, array, date, datetime, json, file, text, image, audio, video, pdf, document, spreadsheet, presentation, dense_vector, sparse_vector, int8_vector, automatic OPTIONAL. List of example values for this field. Used by Apps to show example inputs in the UI. Provide multiple diverse examples when possible.
null
NOT REQUIRED (auto-computed). Lineage chain showing complete processing history. Each entry contains: source_config, feature_extractor, output_schema for one tier. Length indicates processing depth (1 = tier 1, 2 = tier 2, etc.). Use for: Understanding multi-tier pipelines, visualizing decomposition trees.
Show child attributes
Configuration of the source for this lineage entry
Show child attributes
REQUIRED. Type of source for this collection. 'bucket': Process objects from one or more buckets (first-stage processing). 'collection': Process documents from another collection (downstream processing). Use 'bucket' for initial data ingestion, 'collection' for decomposition trees.
bucket, collection, taxonomy, cluster List of bucket IDs when type='bucket'. REQUIRED when type='bucket'. NOT ALLOWED when type='collection'. Can specify one or more buckets to process. Single bucket: Use array with one element ['bkt_id']. Multiple buckets: All buckets MUST have compatible schemas. Schema compatibility validated at collection creation. Compatible schemas have: 1) Same field names, 2) Same field types, 3) Same required status. Documents will include root_bucket_id to track which bucket they came from. Use cases: multi-region data, multi-team consolidation, environment aggregation.
1["bkt_marketing_videos"]Collection ID when type='collection' (single collection). Use this OR collection_ids (not both). REQUIRED when type='collection' and processing single collection. NOT ALLOWED when type='bucket'. The collection will process documents from this upstream collection. The upstream collection's output_schema becomes this collection's input_schema. This enables decomposition trees (multi-stage pipelines). Example: Process frames collection → create scenes collection.
"col_video_frames"
List of collection IDs when type='collection' (multiple collections). Use this OR collection_id (not both). REQUIRED when type='collection' and processing multiple collections. NOT ALLOWED when type='bucket'. Used for operations that consolidate multiple upstream collections. Example: Clustering across multiple collections → cluster output collection. All collections must have compatible schemas for consolidation operations.
1["col_us_products", "col_eu_products"]List of original bucket IDs that source collections originated from. OPTIONAL. Only used when type='collection'. Tracks the complete lineage chain: buckets → collections → derived collections. Extracted from upstream collection metadata at collection creation time. Enables tracing derived collections (like cluster outputs) back to original data sources. Example: Cluster output collection inherits bucket IDs from its source collections. Format: List of bucket IDs with 'bkt_' prefix.
["bkt_marketing_videos"]Optional filters to apply to source data. When specified, only objects/documents matching these filters will be processed by this collection. Filters are evaluated at batch creation time. Uses same LogicalOperator model as list APIs for consistency.
Show child attributes
Optional logical filters to apply to source data. Uses LogicalOperator model with AND/OR/NOT support. When specified, only objects/documents matching these filters will be processed by this collection. When null, all source data is processed (no filtering). Filters are consistent across all batch runs for this collection.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Show child attributes
The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'
"dynamic"Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Show child attributes
The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'
"dynamic"Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Show child attributes
The dot-notation path to the value in the runtime query request, e.g., 'inputs.user_id'
"dynamic"Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
{
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}{
"description": "Filter only video content",
"filters": {
"AND": [
{
"field": "blobs.type",
"operator": "eq",
"value": "video"
}
]
}
}Single feature extractor applied at this stage
Show child attributes
Name of the feature extractor
Version of the feature extractor
Construct unique identifier for the feature extractor instance (name + version).
Parameters for the feature extractor (model name, thresholds, etc.). See extractor-specific documentation for available parameters.
Show child attributes
A typed value for a feature extractor parameter.
Mapping from extractor input names to source field paths. Tells the extractor which source fields to process. Example: {'image': 'thumbnail_url', 'text': 'description'}
Show child attributes
{ "image": "product_image", "text": "title" }NOT REQUIRED. List of specific fields to pass through from source to output documents. These fields are included alongside extractor-computed features (embeddings, detections, etc.). Empty list = only extractor outputs in documents (default behavior). With entries = specified fields + extractor outputs in documents.
How It Works:
Common Use Cases:
Behavior:
Output Schema: output_schema = field_passthrough fields + extractor output fields Example: ['title', 'category', 'text_extractor_v1_embedding']
Show child attributes
REQUIRED. Path to the source field to copy. Simple fields: Use field name directly (e.g., 'title', 'campaign_id'). Nested fields: Use dot notation (e.g., 'metadata.author', 'config.model.version'). The field must exist in the source bucket schema or upstream collection schema. Without target_path, nested fields are flattened: 'metadata.author' becomes 'author' in output.
OPTIONAL. Target field name in output document. If NOT PROVIDED: Uses source_path name (or last component for nested paths). - 'title' → 'title' - 'metadata.author' → 'author' If PROVIDED: Uses this exact name in output. - source_path='doc_title', target_path='title' → 'title' - source_path='metadata.author', target_path='contributor' → 'contributor' Use cases: - Rename fields for cleaner API schemas - Avoid name conflicts with extractor outputs - Standardize field names across different sources Constraints: - Must not conflict with system fields (document_id, collection_id, etc.) - Must not conflict with extractor output fields - Must be a valid field name (alphanumeric, underscores, hyphens)
"title"
OPTIONAL. Default value if source field doesn't exist or is None. If NOT PROVIDED and field missing: Field is omitted from output document. If PROVIDED and field missing: Field is included with this default value. Type should match expected field type (string, int, list, dict, etc.).
"Unknown"
OPTIONAL. Whether this field MUST exist in source. If True and field missing: Raises validation error, processing fails. If False and field missing: Field omitted (or default used if provided). Use True for: Critical identifiers, required business fields. Use False for: Optional metadata, nice-to-have fields. Default: False (field is optional).
NOT REQUIRED. Whether to include ALL fields from source object/document in output. Default: False (only field_passthrough fields included).
When False (RECOMMENDED):
When True (USE WITH CAUTION):
Use True When:
Use False When (MOST CASES):
Examples: False: source={a,b,c,d} + passthrough=[a,b] → output={a,b,embedding} True: source={a,b,c,d} + passthrough=[a→x] → output={x,b,c,d,embedding}
Output schema produced by this processing stage
Show child attributes
Show child attributes
Schema field definition for bucket objects.
Show child attributes
Supported data types for bucket schema fields.
Types fall into two categories:
Explicit Types (Type-Safe):
Automatic Type (Flexible):
string, number, integer, float, boolean, object, array, date, datetime, json, file, text, image, audio, video, pdf, document, spreadsheet, presentation, dense_vector, sparse_vector, int8_vector, automatic OPTIONAL. List of example values for this field. Used by Apps to show example inputs in the UI. Provide multiple diverse examples when possible.
NOT REQUIRED (auto-computed from extractor). Vector indexes for semantic search. Populated from feature_extractor.required_vector_indexes. Defines: Which embeddings are indexed, dimensions, distance metrics. Use for: Understanding search capabilities, debugging vector queries.
NOT REQUIRED (auto-computed from extractor + namespace). Payload indexes for filtering. Enables efficient filtering on metadata fields, timestamps, IDs. Populated from: extractor requirements + namespace defaults. Use for: Understanding which fields support fast filtering.
NOT REQUIRED (defaults to True). Whether the collection accepts new documents. False: Collection exists but won't process new objects. True: Active and processing. Use for: Temporarily disabling collections without deletion.
NOT REQUIRED. Additional user-defined metadata for the collection. Arbitrary key-value pairs for custom organization, tracking, configuration. Not used by the platform - purely for user purposes. Common uses: team ownership, project tags, deployment environment.
{
"environment": "production",
"project": "Q4_campaign",
"team": "data-science"
}Timestamp when the collection was created. Automatically set by the system when the collection is first saved to the database.
Timestamp when the collection was last updated. Automatically updated by the system whenever the collection is modified.
Number of documents currently in the collection. Automatically maintained by the system as documents are added or removed. Used for: Sorting, filtering, and pagination in list endpoints. Updated during batch processing and document operations.
0
Version number for the output_schema. Increments automatically when schema is updated via document sampling. Used to track schema evolution and trigger downstream collection schema updates.
Timestamp of last automatic schema sync from document sampling. Used to debounce schema updates (prevents thrashing).
Whether automatic schema discovery and sync is enabled for this collection. When True, schema is periodically updated by sampling documents. When False, schema remains fixed at creation time.
NOT REQUIRED. List of taxonomies to apply to documents in this collection. Each entry specifies: taxonomy_id, execution_mode (materialize/on_demand), optional filters. Materialized: Enrichments persisted to documents during ingestion. On-demand: Enrichments computed during retrieval (via taxonomy_join stages). Empty/null if no taxonomies attached. Use for: Categorization, hierarchical classification.
Show child attributes
ID of the TaxonomyModel to execute.
'on_demand' executes at query time; 'materialize' materialises during ingestion.
on_demand, materialize Optional collection to persist results into when execution_mode is 'materialize'. If omitted, the source collection is updated in-place.
Additional filters applied when scrolling the source collection before enrichment.
Show child attributes
Logical AND operation - all conditions must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "name",
"operator": "eq",
"value": "John"
},
{
"field": "age",
"operator": "gte",
"value": 30
}
]Logical OR operation - at least one condition must be true
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "status",
"operator": "eq",
"value": "active"
},
{
"field": "role",
"operator": "eq",
"value": "admin"
}
]Logical NOT operation - all conditions must be false
Represents a single filter condition.
Attributes: field: The field to filter on operator: The comparison operator value: The value to compare against
Show child attributes
Field name to filter on
Value to compare against
Comparison operator
eq, ne, gt, lt, gte, lte, in, nin, contains, starts_with, ends_with, regex, exists, is_null, text [
{
"field": "department",
"operator": "eq",
"value": "HR"
},
{
"field": "location",
"operator": "eq",
"value": "remote"
}
]Whether to perform case-sensitive matching
true
null
NOT REQUIRED. List of clusters to automatically execute when batch processing completes. Each entry specifies: cluster_id, auto_execute_on_batch, min_document_threshold, cooldown_seconds. Clusters enrich source documents with cluster assignments (cluster_id, cluster_label, etc.). Empty/null if no clusters attached. Use for: Segmentation, grouping, pattern discovery.
Show child attributes
ID of the cluster to execute (must exist and use this collection as input)
Automatically execute cluster when batch processing completes for this collection. If False, cluster must be executed manually via API.
Minimum number of documents required before executing cluster. If document_count < threshold, clustering is skipped. Useful to avoid clustering on small datasets.
Minimum time (in seconds) between automatic cluster executions. Prevents excessive re-clustering on frequent batch completions. Default: 3600 seconds (1 hour).
null
ID of the source collection that was cloned.
Was this page helpful?