Connect Mixpeek to your cloud object storage for seamless multimodal data ingestion.
Integrate Mixpeek directly with your existing object storage solutions like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage to process and analyze your multimodal data where it lives.
Object storage systems are a fundamental component for storing large amounts of unstructured data, making them ideal upstream sources for Mixpeek. By connecting your buckets, you enable Mixpeek to automatically discover, process, and index your files, unlocking powerful search and analysis capabilities across images, videos, audio, PDFs, and more.
While Mixpeek can process complex, nested data structures within a single bucket connection, a more robust and scalable strategy often involves structuring your data upfront in your object storage, then relying on taxonomy joins to put them together.We rely on the post processing joins to intelligently combine related content after ingestion. This approach offers several advantages over trying to group files during the initial upload:
Explicit Structure
By organizing content into separate buckets or prefixes, you make the relationships between files clear and programmatically accessible. This is more reliable than trying to infer relationships from file names or metadata.
Flexible Joining
Mixpeek’s taxonomy and clustering features allow you to join related content based on multiple criteria (IDs, metadata, semantic similarity) after ingestion, giving you more control over how content is combined.
Scalable Processing
Processing pipelines become simpler and more focused when handling one type of content structure at a time, making it easier to scale and maintain.
Reliable Updates
When new content is added to a structured bucket, Mixpeek can process it independently and then join it with related content, avoiding issues with partial uploads or timing dependencies.
Recommended Approach: Pre-Structured Pipelines
Separate Buckets or Prefixes: Organize related but distinct types of content into separate S3 buckets or dedicated prefixes (folders) within a single bucket.
Example: For analyzing video content, you might store raw videos in s3://my-videos/raw/, extracted transcripts in s3://my-videos/transcripts/, and associated metadata JSON files in s3://my-videos/metadata/.
Multiple Mixpeek Connections: Set up distinct Mixpeek buckets & collections pointing to each specific bucket or prefix.
Join in Mixpeek: After ingestion, use Mixpeek’s enrichment features like Clustering or Taxonomies (joining based on matching IDs or other rules) to link the related pieces of content (e.g., connecting a transcript to its corresponding video and metadata).
This approach is more reliable than alternatives like:
Relying on file naming conventions or metadata (which can be inconsistent)
Using “agent mode” with LLMs to infer relationships (which is experimental and less predictable)
Trying to group files during upload (which can be fragile due to timing issues)
Consider this approach if you are dealing with complex multimodal data where different components (like video, audio, text transcripts, metadata) need to be linked and analyzed together.
Choose your Provider: Select the object storage provider you use (AWS S3, GCS, Azure Blob Storage).
Configure Access: Follow the provider-specific guide to grant Mixpeek secure access to your desired bucket(s). This typically involves setting up appropriate permissions (e.g., read access).
Add Connection in Mixpeek: Use the Mixpeek dashboard or API to add the connection details for your object storage bucket.
Start Processing: Once connected, Mixpeek can begin discovering and processing files according to your configured pipelines.
Ready to connect your data? Select a provider guide above to begin.