Bucket Schema Principles
1. Validate Inputs, Don’t Over-Constrain
Bucket schemas enforce object registration shape but shouldn’t replicate downstream processing logic. Good:2. Use Nested Objects for Grouping
Group related fields to improve readability and support partial updates:3. Arrays for Multi-Valued Fields
Use arrays for fields that naturally have multiple values:4. Separate Mutable and Immutable Fields
Structure schemas to distinguish fields that change vs remain constant:PATCH operations.
Collection Mapping Patterns
1. Use Explicit Input Mappings
Always specifyinput_mappings explicitly rather than relying on defaults:
Good:
2. Passthrough Only What’s Needed
Usefield_passthrough to selectively propagate metadata:
- Large text blobs (duplicate storage)
- Sensitive fields not needed for retrieval
- Computed fields that can be derived on-demand
3. Namespace Feature Outputs
If multiple extractors produce similar outputs, use unique names:4. Leverage Chunking Strategies
Match chunking to content type:| Content Type | Strategy | Rationale |
|---|---|---|
| Blog posts | paragraph | Preserves narrative flow |
| Documentation | sentence | Precise Q&A matching |
| Transcripts | time_window (60s) | Natural speech boundaries |
| Code | function | Semantic units |
Schema Evolution
Adding Fields (Non-Breaking)
New optional fields are safe:subtitle.
Making Fields Required (Breaking)
Requires migration:Changing Field Types (Breaking)
Create a new field instead of mutating:Versioning Collections
For major schema changes, create a new collection:- Keep
products-v1read-only - Process new batches into
products-v2 - Update retrievers to query both collections during transition
- Archive
products-v1after migration
Common Anti-Patterns
❌ Storing Computed Values in Bucket Metadata
Problem:❌ Inconsistent Naming Conventions
Problem:snake_case for compatibility).
❌ Overusing Nested Objects
Problem:❌ Missing Timestamps
Problem: Nocreated_at or updated_at fields.
Solution: Always include audit timestamps:
❌ Hardcoding Enum Values
Problem:"archived" requires schema migration.
Solution: Use flexible text field + application-level validation or taxonomy enrichment.
Validation Best Practices
Use Required Fields Sparingly
Mark fields required only if absolutely necessary for downstream processing:Validate Externally
For complex validation (e.g., “URL must be from allowed domains”), validate in your application before calling Mixpeek.Enable Schema Linting
Check schemas before deployment:Multi-Collection Strategies
Separate by Modality
Create distinct collections per feature type:products-text→ text embeddingsproducts-images→ visual embeddingsproducts-metadata→ structured data only
Separate by Language
For multilingual content:docs-en→ English embeddingsdocs-es→ Spanish embeddingsdocs-fr→ French embeddings
Separate by Lifecycle
For content with different retention policies:logs-hot→ last 7 days (fast storage)logs-warm→ 8-30 days (slower storage)logs-cold→ 30+ days (archive)
Checklist
1
Design bucket schema
- Include required source fields only
- Add audit timestamps (
created_at,updated_at) - Group related fields in nested objects
- Use arrays for multi-valued fields
2
Define collection mappings
- Explicit
input_mappingsfor all extractors - Selective
field_passthrough(no large blobs) - Choose appropriate
chunk_strategy - Namespace outputs if extracting multiple times
3
Plan for evolution
- Add optional fields for new requirements
- Version collections for breaking changes
- Migrate data with backfill scripts
- Deprecate old fields gracefully
4
Validate and test
- Lint schemas before deployment
- Test with representative sample data
- Monitor
__fully_enrichedrates - Review document payloads in Qdrant
Next Steps
- Review Collections for full configuration options
- Explore Feature Extractors capabilities
- Learn Data Model for entity relationships
- Check Buckets API for schema management

