Overview - Mixpeek

Integrate Mixpeek directly with your existing object storage solutions like AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage to process and analyze your multimodal data where it lives.

Object storage systems are a fundamental component for storing large amounts of unstructured data, making them ideal upstream sources for Mixpeek. By connecting your buckets, you enable Mixpeek to automatically discover, process, and index your files, unlocking powerful search and analysis capabilities across images, videos, audio, PDFs, and more.

Why Connect Object Storage?

Centralized Data Processing

Process diverse file types stored in your buckets without needing to move data. Mixpeek accesses files directly from your provider.

Scalable Ingestion

Leverage the scalability of cloud object storage. Mixpeek can handle growing volumes of data as your needs evolve.

Automated Workflows

Set up automated pipelines. New files added to connected buckets can be automatically indexed and enriched by Mixpeek.

Secure Access

Utilize secure authentication methods (like IAM roles or access keys) to grant Mixpeek the necessary permissions to access your data.

Supported Providers

Mixpeek supports direct integration with the major cloud object storage providers:

AWS S3

Connect your Amazon Simple Storage Service (S3) buckets.

Google Cloud Storage

Integrate with Google Cloud Storage (GCS) buckets.

Azure Blob Storage

Connect your Azure Blob Storage containers.

Choose your provider above to find specific setup instructions.

Best Practices for Data Structuring

While Mixpeek can process complex, nested data structures within a single bucket connection, a more robust and scalable strategy often involves structuring your data upfront in your object storage, then relying on taxonomy joins to put them together.

We rely on the post processing joins to intelligently combine related content after ingestion. This approach offers several advantages over trying to group files during the initial upload:

Explicit Structure

By organizing content into separate buckets or prefixes, you make the relationships between files clear and programmatically accessible. This is more reliable than trying to infer relationships from file names or metadata.

Flexible Joining

Mixpeek’s taxonomy and clustering features allow you to join related content based on multiple criteria (IDs, metadata, semantic similarity) after ingestion, giving you more control over how content is combined.

Scalable Processing

Processing pipelines become simpler and more focused when handling one type of content structure at a time, making it easier to scale and maintain.

Reliable Updates

When new content is added to a structured bucket, Mixpeek can process it independently and then join it with related content, avoiding issues with partial uploads or timing dependencies.

Recommended Approach: Pre-Structured Pipelines

Separate Buckets or Prefixes: Organize related but distinct types of content into separate S3 buckets or dedicated prefixes (folders) within a single bucket.
- Example: For analyzing video content, you might store raw videos in s3://my-videos/raw/, extracted transcripts in s3://my-videos/transcripts/, and associated metadata JSON files in s3://my-videos/metadata/.
Multiple Mixpeek Connections: Set up distinct Mixpeek buckets & collections pointing to each specific bucket or prefix.
Join in Mixpeek: After ingestion, use Mixpeek’s enrichment features like Clustering or Taxonomies (joining based on matching IDs or other rules) to link the related pieces of content (e.g., connecting a transcript to its corresponding video and metadata).

This approach is more reliable than alternatives like:

Relying on file naming conventions or metadata (which can be inconsistent)
Using “agent mode” with LLMs to infer relationships (which is experimental and less predictable)
Trying to group files during upload (which can be fragile due to timing issues)

Consider this approach if you are dealing with complex multimodal data where different components (like video, audio, text transcripts, metadata) need to be linked and analyzed together.

Getting Started

Choose your Provider: Select the object storage provider you use (AWS S3, GCS, Azure Blob Storage).
Configure Access: Follow the provider-specific guide to grant Mixpeek secure access to your desired bucket(s). This typically involves setting up appropriate permissions (e.g., read access).
Add Connection in Mixpeek: Use the Mixpeek dashboard or API to add the connection details for your object storage bucket.
Start Processing: Once connected, Mixpeek can begin discovering and processing files according to your configured pipelines.

Ready to connect your data? Select a provider guide above to begin.

Integrations

​Why Connect Object Storage?

Centralized Data Processing

Scalable Ingestion

Automated Workflows

Secure Access

​Supported Providers

AWS S3

Google Cloud Storage

Azure Blob Storage

​Best Practices for Data Structuring

Explicit Structure

Flexible Joining

Scalable Processing

Reliable Updates

​Getting Started

Why Connect Object Storage?

Supported Providers

Best Practices for Data Structuring

Getting Started