Bucket Schema Principles
1. Validate Inputs, Don’t Over-Constrain
Bucket schemas enforce object registration shape but shouldn’t replicate downstream processing logic. Good:2. Use Nested Objects for Grouping
Group related fields to improve readability and support partial updates:3. Arrays for Multi-Valued Fields
Use arrays for fields that naturally have multiple values:4. Separate Mutable and Immutable Fields
Structure schemas to distinguish fields that change vs remain constant:PATCH operations.
Collection Mapping Patterns
1. Use Explicit Input Mappings
Always specifyinput_mappings explicitly rather than relying on defaults:
Good:
2. Passthrough Only What’s Needed
Usefield_passthrough to selectively propagate metadata:
- Large text blobs (duplicate storage)
- Sensitive fields not needed for retrieval
- Computed fields that can be derived on-demand
3. Namespace Feature Outputs
If multiple extractors produce similar outputs, use unique names:4. Leverage Chunking Strategies
Match chunking to content type:| Content Type | Strategy | Rationale | 
|---|---|---|
| Blog posts | paragraph | Preserves narrative flow | 
| Documentation | sentence | Precise Q&A matching | 
| Transcripts | time_window(60s) | Natural speech boundaries | 
| Code | function | Semantic units | 
Schema Evolution
Adding Fields (Non-Breaking)
New optional fields are safe:subtitle.
Making Fields Required (Breaking)
Requires migration:Changing Field Types (Breaking)
Create a new field instead of mutating:Versioning Collections
For major schema changes, create a new collection:- Keep products-v1read-only
- Process new batches into products-v2
- Update retrievers to query both collections during transition
- Archive products-v1after migration
Common Anti-Patterns
❌ Storing Computed Values in Bucket Metadata
Problem:❌ Inconsistent Naming Conventions
Problem:snake_case for compatibility).
❌ Overusing Nested Objects
Problem:❌ Missing Timestamps
Problem: Nocreated_at or updated_at fields.
Solution: Always include audit timestamps:
❌ Hardcoding Enum Values
Problem:"archived" requires schema migration.
Solution: Use flexible text field + application-level validation or taxonomy enrichment.
Validation Best Practices
Use Required Fields Sparingly
Mark fields required only if absolutely necessary for downstream processing:Validate Externally
For complex validation (e.g., “URL must be from allowed domains”), validate in your application before calling Mixpeek.Enable Schema Linting
Check schemas before deployment:Multi-Collection Strategies
Separate by Modality
Create distinct collections per feature type:- products-text→ text embeddings
- products-images→ visual embeddings
- products-metadata→ structured data only
Separate by Language
For multilingual content:- docs-en→ English embeddings
- docs-es→ Spanish embeddings
- docs-fr→ French embeddings
Separate by Lifecycle
For content with different retention policies:- logs-hot→ last 7 days (fast storage)
- logs-warm→ 8-30 days (slower storage)
- logs-cold→ 30+ days (archive)
Checklist
1
Design bucket schema
- Include required source fields only
- Add audit timestamps (created_at,updated_at)
- Group related fields in nested objects
- Use arrays for multi-valued fields
2
Define collection mappings
- Explicit input_mappingsfor all extractors
- Selective field_passthrough(no large blobs)
- Choose appropriate chunk_strategy
- Namespace outputs if extracting multiple times
3
Plan for evolution
- Add optional fields for new requirements
- Version collections for breaking changes
- Migrate data with backfill scripts
- Deprecate old fields gracefully
4
Validate and test
- Lint schemas before deployment
- Test with representative sample data
- Monitor __fully_enrichedrates
- Review document payloads in Qdrant
Next Steps
- Review Collections for full configuration options
- Explore Feature Extractors capabilities
- Learn Data Model for entity relationships
- Check Buckets API for schema management

