Overview
Connect Mixpeek to your cloud object storage for seamless multimodal data ingestion.
Integrate Mixpeek directly with your existing object storage solutions like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage to process and analyze your multimodal data where it lives.
Object storage systems are a fundamental component for storing large amounts of unstructured data, making them ideal upstream sources for Mixpeek. By connecting your buckets, you enable Mixpeek to automatically discover, process, and index your files, unlocking powerful search and analysis capabilities across images, videos, audio, PDFs, and more.
Why Connect Object Storage?
Centralized Data Processing
Process diverse file types stored in your buckets without needing to move data. Mixpeek accesses files directly from your provider.
Scalable Ingestion
Leverage the scalability of cloud object storage. Mixpeek can handle growing volumes of data as your needs evolve.
Automated Workflows
Set up automated pipelines. New files added to connected buckets can be automatically indexed and enriched by Mixpeek.
Secure Access
Utilize secure authentication methods (like IAM roles or access keys) to grant Mixpeek the necessary permissions to access your data.
Supported Providers
Mixpeek supports direct integration with the major cloud object storage providers:
AWS S3
Connect your Amazon Simple Storage Service (S3) buckets.
Google Cloud Storage
Integrate with Google Cloud Storage (GCS) buckets.
Azure Blob Storage
Connect your Azure Blob Storage containers.
Choose your provider above to find specific setup instructions.
Best Practices for Data Structuring
While Mixpeek can process complex, nested data structures within a single bucket connection, a more robust and scalable strategy often involves structuring your data upfront in your object storage, then relying on taxonomy joins to put them together.
We rely on the post processing joins to intelligently combine related content after ingestion. This approach offers several advantages over trying to group files during the initial upload:
Explicit Structure
By organizing content into separate buckets or prefixes, you make the relationships between files clear and programmatically accessible. This is more reliable than trying to infer relationships from file names or metadata.
Flexible Joining
Mixpeek’s taxonomy and clustering features allow you to join related content based on multiple criteria (IDs, metadata, semantic similarity) after ingestion, giving you more control over how content is combined.
Scalable Processing
Processing pipelines become simpler and more focused when handling one type of content structure at a time, making it easier to scale and maintain.
Reliable Updates
When new content is added to a structured bucket, Mixpeek can process it independently and then join it with related content, avoiding issues with partial uploads or timing dependencies.
Recommended Approach: Pre-Structured Pipelines
- Separate Buckets or Prefixes: Organize related but distinct types of content into separate S3 buckets or dedicated prefixes (folders) within a single bucket.
- Example: For analyzing video content, you might store raw videos in
s3://my-videos/raw/
, extracted transcripts ins3://my-videos/transcripts/
, and associated metadata JSON files ins3://my-videos/metadata/
.
- Example: For analyzing video content, you might store raw videos in
- Multiple Mixpeek Connections: Set up distinct Mixpeek buckets & collections pointing to each specific bucket or prefix.
- Join in Mixpeek: After ingestion, use Mixpeek’s enrichment features like Clustering or Taxonomies (joining based on matching IDs or other rules) to link the related pieces of content (e.g., connecting a transcript to its corresponding video and metadata).
This approach is more reliable than alternatives like:
- Relying on file naming conventions or metadata (which can be inconsistent)
- Using “agent mode” with LLMs to infer relationships (which is experimental and less predictable)
- Trying to group files during upload (which can be fragile due to timing issues)
Consider this approach if you are dealing with complex multimodal data where different components (like video, audio, text transcripts, metadata) need to be linked and analyzed together.
Getting Started
- Choose your Provider: Select the object storage provider you use (AWS S3, GCS, Azure Blob Storage).
- Configure Access: Follow the provider-specific guide to grant Mixpeek secure access to your desired bucket(s). This typically involves setting up appropriate permissions (e.g., read access).
- Add Connection in Mixpeek: Use the Mixpeek dashboard or API to add the connection details for your object storage bucket.
- Start Processing: Once connected, Mixpeek can begin discovering and processing files according to your configured pipelines.
Ready to connect your data? Select a provider guide above to begin.