Schema Design

Well-designed schemas balance validation, flexibility, and performance. This guide covers bucket schema patterns, collection field mappings, and evolution strategies to keep your data model clean and scalable.

Bucket Schema Principles

1. Validate Inputs, Don’t Over-Constrain

Bucket schemas enforce object registration shape but shouldn’t replicate downstream processing logic. Good:

{
  "schema": {
    "properties": {
      "title": { "type": "text", "required": true },
      "content": { "type": "text", "required": true },
      "category": { "type": "text" },
      "published_at": { "type": "datetime" }
    }
  }
}

Avoid:

{
  "schema": {
    "properties": {
      "title": { "type": "text", "required": true, "min_length": 10, "max_length": 200 },
      "content": { "type": "text", "required": true, "must_contain": ["keyword"] },
      "category": { "type": "text", "enum": ["tech", "business"] }  // Hard to extend
    }
  }
}

Why: Collections can apply transformations and filters. Bucket schemas should validate data integrity, not business rules.

2. Use Nested Objects for Grouping

Group related fields to improve readability and support partial updates:

{
  "schema": {
    "properties": {
      "content": {
        "type": "object",
        "properties": {
          "title": { "type": "text" },
          "body": { "type": "text" },
          "summary": { "type": "text" }
        }
      },
      "metadata": {
        "type": "object",
        "properties": {
          "author": { "type": "text" },
          "tags": { "type": "array" },
          "published_at": { "type": "datetime" }
        }
      }
    }
  }
}

3. Arrays for Multi-Valued Fields

Use arrays for fields that naturally have multiple values:

{
  "tags": { "type": "array", "items": { "type": "text" } },
  "authors": { "type": "array", "items": { "type": "text" } },
  "images": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "url": { "type": "url" },
        "caption": { "type": "text" }
      }
    }
  }
}

4. Separate Mutable and Immutable Fields

Structure schemas to distinguish fields that change vs remain constant:

{
  "immutable": {
    "created_at": { "type": "datetime" },
    "source_system": { "type": "text" },
    "original_filename": { "type": "text" }
  },
  "mutable": {
    "status": { "type": "text" },
    "assignee": { "type": "text" },
    "priority": { "type": "number" }
  }
}

This pattern clarifies which fields can be updated via PATCH operations.

Collection Mapping Patterns

1. Use Explicit Input Mappings

Always specify input_mappings explicitly rather than relying on defaults: Good:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "input_mappings": {
      "text": "content.body"  // Clear source path
    }
  }
}

Avoid:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor"
    // Implicitly maps "text" field - fragile if bucket schema changes
  }
}

2. Passthrough Only What’s Needed

Use field_passthrough to selectively propagate metadata:

{
  "field_passthrough": [
    { "source_path": "metadata.category" },
    { "source_path": "metadata.tags" },
    { "source_path": "metadata.published_at" }
  ]
}

Don’t passthrough:

Large text blobs (duplicate storage)
Sensitive fields not needed for retrieval
Computed fields that can be derived on-demand

3. Namespace Feature Outputs

If multiple extractors produce similar outputs, use unique names:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "output_namespace": "en"  // Produces mixpeek://text_extractor@v1/en/text_embedding
  }
}

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "output_namespace": "es",
    "parameters": { "model": "multilingual-e5-large-instruct", "language": "es" }
  }
}

This enables language-specific retrievers without collection duplication.

4. Leverage Chunking Strategies

Match chunking to content type:

Content Type	Strategy	Rationale
Blog posts	`paragraph`	Preserves narrative flow
Documentation	`sentence`	Precise Q&A matching
Transcripts	`time_window` (60s)	Natural speech boundaries
Code	`function`	Semantic units

{
  "parameters": {
    "chunk_strategy": "paragraph",
    "chunk_size": 512,
    "chunk_overlap": 50
  }
}

Schema Evolution

Adding Fields (Non-Breaking)

New optional fields are safe:

// Before
{
  "schema": {
    "properties": {
      "title": { "type": "text" }
    }
  }
}

// After (safe)
{
  "schema": {
    "properties": {
      "title": { "type": "text" },
      "subtitle": { "type": "text" }  // Optional, non-breaking
    }
  }
}

Existing objects remain valid; new objects can include subtitle.

Making Fields Required (Breaking)

Requires migration:

// Step 1: Add field as optional
{ "description": { "type": "text" } }

// Step 2: Backfill existing objects
PATCH /v1/buckets/{bucket_id}/objects/{object_id}
{ "metadata": { "description": "Default description" } }

// Step 3: Make required
{ "description": { "type": "text", "required": true } }

Changing Field Types (Breaking)

Create a new field instead of mutating:

// Before
{ "price": { "type": "text" } }  // "19.99"

// Migration (add new field)
{ "price_numeric": { "type": "number" } }  // 19.99

// Deprecate old field
{ "price": { "type": "text", "deprecated": true } }

Versioning Collections

For major schema changes, create a new collection:

POST /v1/collections
{
  "collection_name": "products-v2",
  "source": { "type": "bucket", "bucket_id": "bkt_products" },
  "feature_extractor": {
    // Updated mappings and extractors
  }
}

Migrate documents:

Keep products-v1 read-only
Process new batches into products-v2
Update retrievers to query both collections during transition
Archive products-v1 after migration

Common Anti-Patterns

❌ Storing Computed Values in Bucket Metadata

Problem:

{
  "metadata": {
    "content": "Sample text...",
    "word_count": 150,  // Computed from content
    "embedding": [0.1, 0.2, ...]  // Computed by extractor
  }
}

Solution: Store only source data in buckets; let extractors compute derived values.

❌ Inconsistent Naming Conventions

Problem:

{
  "CreatedDate": "...",  // PascalCase
  "updated_at": "...",   // snake_case
  "PublishTime": "..."   // Mixed
}

Solution: Enforce consistent naming (prefer snake_case for compatibility).

❌ Overusing Nested Objects

Problem:

{
  "data": {
    "content": {
      "main": {
        "text": {
          "body": "..."  // 5 levels deep
        }
      }
    }
  }
}

Solution: Flatten to 2-3 levels max for readability and query simplicity.

❌ Missing Timestamps

Problem: No created_at or updated_at fields. Solution: Always include audit timestamps:

{
  "created_at": { "type": "datetime", "required": true },
  "updated_at": { "type": "datetime" }
}

❌ Hardcoding Enum Values

Problem:

{
  "status": { "type": "text", "enum": ["draft", "published"] }
}

Adding "archived" requires schema migration. Solution: Use flexible text field + application-level validation or taxonomy enrichment.

Validation Best Practices

Use Required Fields Sparingly

Mark fields required only if absolutely necessary for downstream processing:

{
  "title": { "type": "text", "required": true },  // Extractors need this
  "tags": { "type": "array" }  // Optional, but useful
}

Validate Externally

For complex validation (e.g., “URL must be from allowed domains”), validate in your application before calling Mixpeek.

Enable Schema Linting

Check schemas before deployment:

# Validate schema before creating bucket
POST /v1/buckets/validate
{
  "schema": { ... }
}

Multi-Collection Strategies

Separate by Modality

Create distinct collections per feature type:

products-text → text embeddings
products-images → visual embeddings
products-metadata → structured data only

Query multiple collections via retrievers for cross-modal search.

Separate by Language

For multilingual content:

docs-en → English embeddings
docs-es → Spanish embeddings
docs-fr → French embeddings

Use retriever stages to route queries by detected language.

Separate by Lifecycle

For content with different retention policies:

logs-hot → last 7 days (fast storage)
logs-warm → 8-30 days (slower storage)
logs-cold → 30+ days (archive)

Checklist

Design bucket schema

Include required source fields only
Add audit timestamps (created_at, updated_at)
Group related fields in nested objects
Use arrays for multi-valued fields

Define collection mappings

Explicit input_mappings for all extractors
Selective field_passthrough (no large blobs)
Choose appropriate chunk_strategy
Namespace outputs if extracting multiple times

Plan for evolution

Add optional fields for new requirements
Version collections for breaking changes
Migrate data with backfill scripts
Deprecate old fields gracefully

Validate and test

Lint schemas before deployment
Test with representative sample data
Monitor __fully_enriched rates
Review document payloads in Qdrant

Next Steps

Review Collections for full configuration options
Explore Feature Extractors capabilities
Learn Data Model for entity relationships
Check Buckets API for schema management

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Bucket Schema Principles

1. Validate Inputs, Don’t Over-Constrain

2. Use Nested Objects for Grouping

3. Arrays for Multi-Valued Fields

4. Separate Mutable and Immutable Fields

Collection Mapping Patterns

1. Use Explicit Input Mappings

2. Passthrough Only What’s Needed

3. Namespace Feature Outputs

4. Leverage Chunking Strategies

Schema Evolution

Adding Fields (Non-Breaking)

Making Fields Required (Breaking)

Changing Field Types (Breaking)

Versioning Collections

Common Anti-Patterns

❌ Storing Computed Values in Bucket Metadata

❌ Inconsistent Naming Conventions

❌ Overusing Nested Objects

❌ Missing Timestamps

❌ Hardcoding Enum Values

Validation Best Practices

Use Required Fields Sparingly

Validate Externally

Enable Schema Linting

Multi-Collection Strategies

Separate by Modality

Separate by Language

Separate by Lifecycle

Checklist

Next Steps

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Bucket Schema Principles

​1. Validate Inputs, Don’t Over-Constrain

​2. Use Nested Objects for Grouping

​3. Arrays for Multi-Valued Fields

​4. Separate Mutable and Immutable Fields

​Collection Mapping Patterns

​1. Use Explicit Input Mappings

​2. Passthrough Only What’s Needed

​3. Namespace Feature Outputs

​4. Leverage Chunking Strategies

​Schema Evolution

​Adding Fields (Non-Breaking)

​Making Fields Required (Breaking)

​Changing Field Types (Breaking)

​Versioning Collections

​Common Anti-Patterns

​❌ Storing Computed Values in Bucket Metadata

​❌ Inconsistent Naming Conventions

​❌ Overusing Nested Objects

​❌ Missing Timestamps

​❌ Hardcoding Enum Values

​Validation Best Practices

​Use Required Fields Sparingly

​Validate Externally

​Enable Schema Linting

​Multi-Collection Strategies

​Separate by Modality

​Separate by Language

​Separate by Lifecycle

​Checklist

​Next Steps

Bucket Schema Principles

1. Validate Inputs, Don’t Over-Constrain

2. Use Nested Objects for Grouping

3. Arrays for Multi-Valued Fields

4. Separate Mutable and Immutable Fields

Collection Mapping Patterns

1. Use Explicit Input Mappings

2. Passthrough Only What’s Needed

3. Namespace Feature Outputs

4. Leverage Chunking Strategies

Schema Evolution

Adding Fields (Non-Breaking)

Making Fields Required (Breaking)

Changing Field Types (Breaking)

Versioning Collections

Common Anti-Patterns

❌ Storing Computed Values in Bucket Metadata

❌ Inconsistent Naming Conventions

❌ Overusing Nested Objects

❌ Missing Timestamps

❌ Hardcoding Enum Values

Validation Best Practices

Use Required Fields Sparingly

Validate Externally

Enable Schema Linting

Multi-Collection Strategies

Separate by Modality

Separate by Language

Separate by Lifecycle

Checklist

Next Steps