Skip to main content
POST
/
v1
/
collections
Create Collection
curl --request POST \
  --url https://api.mixpeek.com/v1/collections \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --header 'X-Namespace: <x-namespace>' \
  --data '{
  "collection_name": "product_embeddings",
  "description": "Generate image embeddings with passthrough fields",
  "enabled": true,
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "field_passthrough": [
      {
        "required": true,
        "source_path": "title"
      },
      {
        "source_path": "category"
      }
    ],
    "input_mappings": {
      "image": "product_image"
    },
    "parameters": {
      "model": "clip-vit-base-patch32"
    },
    "version": "v1"
  },
  "source": {
    "bucket_id": "bkt_products",
    "type": "bucket"
  }
}'
{
  "collection_id": "col_a1b2c3d4e5",
  "collection_name": "article_embeddings",
  "description": "Simple text collection: News articles with text embeddings from bucket source",
  "enabled": true,
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "field_passthrough": [
      {
        "source_path": "title"
      }
    ],
    "input_mappings": {
      "text": "content"
    },
    "version": "v1"
  },
  "input_schema": {
    "properties": {
      "title": {
        "type": "string"
      },
      "content": {
        "type": "text"
      }
    }
  },
  "output_schema": {
    "properties": {
      "title": {
        "type": "string"
      },
      "text_extractor_v1_embedding": {
        "type": "array"
      }
    }
  },
  "source": {
    "bucket_id": "bkt_articles",
    "type": "bucket"
  }
}

Authorizations

Authorization
string
header
required

Bearer token authentication using your API key. Format: 'Bearer your_api_key'. To get an API key, create an account at mixpeek.com/start and generate a key in your account settings.

Headers

Authorization
string
required

REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.

Examples:

"Bearer sk_live_abc123def456"

"Bearer sk_test_xyz789"

X-Namespace
string
required

REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'

Examples:

"ns_abc123def456"

"production"

"my-namespace"

Body

application/json

Request model for creating a new collection.

Collections process data from buckets or other collections using a single feature extractor.

KEY ARCHITECTURAL CHANGE: Each collection has EXACTLY ONE feature extractor.

  • Use field_passthrough to include additional source fields in output documents
  • Multiple extractors = multiple collections
  • This simplifies processing and makes output schema deterministic

CRITICAL: To use input_mappings:

  1. Your source bucket MUST have a bucket_schema defined
  2. The input_mappings reference fields from that bucket_schema
  3. The system validates that mapped fields exist in the source schema

Example workflow:

  1. Create bucket with schema: { "properties": { "image": {"type": "image"}, "title": {"type": "string"} } }
  2. Upload objects conforming to that schema
  3. Create collection with:
    • input_mappings: { "image": "image" }
    • field_passthrough: [{"source_path": "title"}]
  4. Output documents will have both extractor outputs AND passthrough fields

Schema Computation:

  • output_schema is computed IMMEDIATELY when collection is created
  • output_schema = field_passthrough fields + extractor output fields
  • No waiting for documents to be processed!
collection_name
string
required

Name of the collection to create

Examples:

"video_frames"

"product_embeddings"

source
object
required

Source configuration (bucket or collection) for this collection

Examples:
{
"bucket_ids": ["bkt_marketing_videos"],
"description": "Single bucket source",
"type": "bucket"
}
{
"bucket_ids": [
"bkt_us_products",
"bkt_eu_products",
"bkt_asia_products"
],
"description": "Multi-bucket source (region consolidation)",
"type": "bucket"
}
{
"bucket_ids": [
"bkt_marketing_assets",
"bkt_sales_assets",
"bkt_support_assets"
],
"description": "Multi-bucket source (team consolidation)",
"type": "bucket"
}
{
"collection_id": "col_video_frames",
"description": "Collection source (decomposition tree)",
"type": "collection"
}
feature_extractor
object
required

Single feature extractor for this collection. Use field_passthrough within the extractor config to include additional source fields. For multiple extractors, create multiple collections and use collection-to-collection pipelines.

Examples:
{
"description": "Text extractor with field passthrough",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{ "required": true, "source_path": "title" },
{ "source_path": "author" }
],
"input_mappings": { "text": "content" },
"parameters": { "model": "text-embedding-3-small" },
"version": "v1"
}
{
"description": "Video extractor with campaign metadata",
"feature_extractor_name": "video_extractor",
"field_passthrough": [
{ "source_path": "campaign_id" },
{ "source_path": "duration_seconds" }
],
"input_mappings": { "video": "video_url" },
"parameters": { "fps": 1 },
"version": "v1"
}
{
"description": "Image extractor with all source fields",
"feature_extractor_name": "image_extractor",
"include_all_source_fields": true,
"input_mappings": { "image": "product_image" },
"parameters": { "model": "clip-vit-base-patch32" },
"version": "v1"
}
{
"description": "Text extractor with field renaming",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{
"source_path": "metadata.author",
"target_path": "author"
},
{
"source_path": "metadata.created_at",
"target_path": "created_at"
}
],
"input_mappings": { "text": "content" },
"version": "v1"
}
description
string | null

Description of the collection

input_schema
object | null

Input schema for the collection. If not provided, inferred from source bucket's bucket_schema or source collection's output_schema. REQUIRED for input_mappings to work - defines what fields can be mapped to feature extractors. Schema definition for bucket objects.

IMPORTANT: The bucket schema defines what fields your bucket objects will have. This schema is REQUIRED if you want to:

  1. Create collections that use input_mappings to process your bucket data
  2. Validate object structure before ingestion
  3. Enable type-safe data pipelines

The schema defines the custom fields that will be used in:

  • Blob properties (e.g., "content", "thumbnail", "transcript")
  • Object metadata structure
  • Blob data structures

Example workflow:

  1. Create bucket WITH schema defining your data structure
  2. Upload objects that conform to that schema
  3. Create collections that map schema fields to feature extractors

Without a bucket_schema, collections cannot use input_mappings.

enabled
boolean
default:true

Whether the collection is enabled

metadata
object | null

Additional metadata for the collection

Response

Successful Response

Response model for collection endpoints.

collection_name
string
required

REQUIRED. Human-readable name for the collection. Must be unique within the namespace. Used for: Display, lookups (can query by name or ID), organization. Format: Alphanumeric with underscores/hyphens, 3-100 characters. Examples: 'product_embeddings', 'video_frames', 'customer_documents'.

Required string length: 3 - 100
Examples:

"video_frames"

"product_embeddings"

"customer_documents"

input_schema
object
required

REQUIRED (auto-computed from source). Input schema defining fields available to the feature extractor. Source: bucket.bucket_schema (if source.type='bucket') OR upstream_collection.output_schema (if source.type='collection'). Determines: Which fields can be used in input_mappings and field_passthrough. This is the 'left side' of the transformation - what data goes IN. Format: BucketSchema with properties dict. Use for: Validating input_mappings, configuring field_passthrough.

output_schema
object
required

REQUIRED (auto-computed at creation). Output schema defining fields in final documents. Computed as: field_passthrough fields + extractor output fields (deterministic). Known IMMEDIATELY when collection is created - no waiting for documents! This is the 'right side' of the transformation - what data comes OUT. Use for: Understanding document structure, building queries, schema validation. Example: {title (passthrough), embedding (extractor output)} = output_schema.

feature_extractor
object
required

REQUIRED. Single feature extractor configuration for this collection. Defines: extractor name/version, input_mappings, field_passthrough, parameters. Task 9 change: ONE extractor per collection (previously supported multiple). For multiple extractors: Create multiple collections and use collection-to-collection pipelines. Use field_passthrough to include additional source fields beyond extractor outputs.

Examples:
{
"description": "Text extractor with field passthrough",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{ "required": true, "source_path": "title" },
{ "source_path": "author" }
],
"input_mappings": { "text": "content" },
"parameters": { "model": "text-embedding-3-small" },
"version": "v1"
}
{
"description": "Video extractor with campaign metadata",
"feature_extractor_name": "video_extractor",
"field_passthrough": [
{ "source_path": "campaign_id" },
{ "source_path": "duration_seconds" }
],
"input_mappings": { "video": "video_url" },
"parameters": { "fps": 1 },
"version": "v1"
}
{
"description": "Image extractor with all source fields",
"feature_extractor_name": "image_extractor",
"include_all_source_fields": true,
"input_mappings": { "image": "product_image" },
"parameters": { "model": "clip-vit-base-patch32" },
"version": "v1"
}
{
"description": "Text extractor with field renaming",
"feature_extractor_name": "text_extractor",
"field_passthrough": [
{
"source_path": "metadata.author",
"target_path": "author"
},
{
"source_path": "metadata.created_at",
"target_path": "created_at"
}
],
"input_mappings": { "text": "content" },
"version": "v1"
}
source
object
required

REQUIRED. Source configuration defining where data comes from. Type 'bucket': Process objects from one or more buckets (tier 1). Type 'collection': Process documents from upstream collection (tier 2+). For multi-bucket sources, all buckets must have compatible schemas. Determines input_schema and enables decomposition trees.

Examples:
{
"bucket_ids": ["bkt_marketing_videos"],
"description": "Single bucket source",
"type": "bucket"
}
{
"bucket_ids": [
"bkt_us_products",
"bkt_eu_products",
"bkt_asia_products"
],
"description": "Multi-bucket source (region consolidation)",
"type": "bucket"
}
{
"bucket_ids": [
"bkt_marketing_assets",
"bkt_sales_assets",
"bkt_support_assets"
],
"description": "Multi-bucket source (team consolidation)",
"type": "bucket"
}
{
"collection_id": "col_video_frames",
"description": "Collection source (decomposition tree)",
"type": "collection"
}
collection_id
string

NOT REQUIRED (auto-generated). Unique identifier for this collection. Used for: API paths, document queries, pipeline references. Format: 'col_' prefix + 10 random alphanumeric characters. Stable after creation - use for all collection references.

Examples:

"col_a1b2c3d4e5"

"col_xyz789abc"

"col_video_frames"

description
string | null

NOT REQUIRED. Human-readable description of the collection's purpose. Use for: Documentation, team communication, UI display. Common pattern: Describe what the collection contains and what processing is applied.

Examples:

"Video frames extracted at 1 FPS with CLIP embeddings"

"Product catalog with image embeddings and metadata"

source_bucket_schemas
object | null

NOT REQUIRED (auto-computed). Snapshot of bucket schemas at collection creation. Only populated for multi-bucket collections (source.type='bucket' with multiple bucket_ids). Key: bucket_id, Value: BucketSchema at time of collection creation. Used for: Schema compatibility validation, document lineage, debugging. Schema snapshot is immutable - bucket schema changes after collection creation do not affect this. Single-bucket collections may omit this field (schema in input_schema is sufficient).

Examples:

null

{
"bkt_eu_products": {
"properties": {
"image": { "type": "image" },
"title": { "type": "string" },
"price": { "type": "number" }
},
"required": ["image", "title"]
},
"bkt_us_products": {
"properties": {
"image": { "type": "image" },
"title": { "type": "string" },
"price": { "type": "number" }
},
"required": ["image", "title"]
}
}
source_lineage
SingleLineageEntry · object[]

NOT REQUIRED (auto-computed). Lineage chain showing complete processing history. Each entry contains: source_config, feature_extractor, output_schema for one tier. Length indicates processing depth (1 = tier 1, 2 = tier 2, etc.). Use for: Understanding multi-tier pipelines, visualizing decomposition trees.

Examples:
[]
[
{
"output_schema": {},
"source_config": {
"bucket_id": "bkt_videos",
"type": "bucket"
}
}
]
vector_indexes
any[]

NOT REQUIRED (auto-computed from extractor). Vector indexes for semantic search. Populated from feature_extractor.required_vector_indexes. Defines: Which embeddings are indexed, dimensions, distance metrics. Use for: Understanding search capabilities, debugging vector queries.

Examples:
[]
[
{
"distance": "cosine",
"name": "text_extractor_v1_embedding",
"size": 1024
}
]
payload_indexes
any[]

NOT REQUIRED (auto-computed from extractor + namespace). Payload indexes for filtering. Enables efficient filtering on metadata fields, timestamps, IDs. Populated from: extractor requirements + namespace defaults. Use for: Understanding which fields support fast filtering.

Examples:
[]
[
{
"field_name": "collection_id",
"type": "keyword"
}
]
enabled
boolean
default:true

NOT REQUIRED (defaults to True). Whether the collection accepts new documents. False: Collection exists but won't process new objects. True: Active and processing. Use for: Temporarily disabling collections without deletion.

Examples:

true

false

metadata
object | null

NOT REQUIRED. Additional user-defined metadata for the collection. Arbitrary key-value pairs for custom organization, tracking, configuration. Not used by the platform - purely for user purposes. Common uses: team ownership, project tags, deployment environment.

Examples:
{
"environment": "production",
"project": "Q4_campaign",
"team": "data-science"
}

null

created_at
string<date-time> | null

Timestamp when the collection was created. Automatically set by the system when the collection is first saved to the database.

updated_at
string<date-time> | null

Timestamp when the collection was last updated. Automatically updated by the system whenever the collection is modified.

document_count
integer | null

Number of documents in the collection

taxonomy_applications
TaxonomyApplicationConfig · object[] | null

NOT REQUIRED. List of taxonomies to apply to documents in this collection. Each entry specifies: taxonomy_id, execution_mode (materialize/on_demand), optional filters. Materialized: Enrichments persisted to documents during ingestion. On-demand: Enrichments computed during retrieval (via taxonomy_join stages). Empty/null if no taxonomies attached. Use for: Categorization, hierarchical classification.

Examples:

null

[
{
"execution_mode": "on_demand",
"taxonomy_id": "tax_categories"
},
{
"execution_mode": "materialize",
"taxonomy_id": "tax_brands"
}
]
taxonomy_count
integer | null

Number of taxonomies connected to this collection

retriever_count
integer | null

Number of retrievers connected to this collection