What is Mixpeek?

Your app probably stores unstructured and structured data across various datatypes: documents, images, video and audio not the least of which is text.

Unstructured data tends to live in an object store like Amazon S3, Azure Blob or Google Cloud Storage. Structured in your database like Postgres or MongoDB.

Mixpeek lets you treat both your object store and transactional database as a single entity.

How does it work?

Mixpeek is a multimodal pipeline development kit, which is simply a chain of ML models applied to your objects as they’re uploaded, then sent into your transactional database.

This allows you to structure your unstructured data using custom, AI-powered logic in real-time.

mixpeek diagram

You can think of it like if Fivetran and Sagemaker had a baby, with “guaranteed execution” in between.

A Simple Example

Say you upload an image: dog.png to your S3 bucket and you want to extract the tags and create an embedding of the image itself.

You’d create a Mixpeek pipeline, which is a serverless function that combines ML models and gets invoked whenever there’s a new object in your bucket.

Here’s an example pipeline that creates a description, embedding and tags.

def function(event, context):
  mixpeek = Mixpeek(api_key="API_KEY", event.object_url)

  # create a description
  description = mixpeek.extract.text(model="openai/gpt-4o")
  # create an embedding
  embedding = mixpeek.embed.image(model="openai/clip-vit-base-patch32")
  # create tags
  tags = mixpeek.extract.text(model="microsoft/conditional-detr-resnet-50")

  return {
    "object_url": payload.object_url,
    "text": description,
    "tags": tags,
    "embedding": embedding
}

Once we create this pipeline then connect our S3 bucket and finally MongoDB collection then enable it.

Your pipeline object would look like this:

{
  "alias": "dog-pipeline",
  "enabled": true,
  "source": {
    "connection_id": "123",
    "bucket": "dogs"
  },
  "destination": {
    "connection_id": "321",
    "collection": "dog_objects"
  }
}

Previously, we would have created an AWS IAM role, which opens up a listener on your S3 bucket dogs, every new object gets sent through the pipeline-as-code we defined above and then sent into our MongoDB collection: dog_objects (which was also instantiated previously).

That’s really it! Mixpeek is designed to be “set and forget”, never think about processing your S3 bucket again.

Here’s an example output from the S3 object: dog.png that gets sent into your MongoDB collection:

{
  "object_url": "s3://dog.png",
  "text": "australian shepherd",
  "tags": ["dog", "shepherd"],
  "embedding": [0, 1, 2, 3],
  "metadata": {
    "pipeline_version": "v1"
  }
}

What does it enable?

Since you’ll have fresh vectors, metadata and extracted contents you can design hyper-targeted queries that span all your use cases:

  • RAG (Retrieval Augmented Generation)
  • Recommendation Systems
  • Hybrid Search Engines

All without having to think about data prep again. You can even modify your pipeline, and the new version will be appended to the metadata.pipeline_version key, so you can filter your pipeline code changes against the output data.

Use the Methods directly

You can also use the extract, embed and generate methods outside of a pipeline.

from mixpeek.client import Mixpeek

mixpeek = Mixpeek(api_key="API_KEY")

output = mixpeek.embed.text("lorem ipsum", model="mixedbread-ai/mxbai-embed-large-v1", )

Mixpeek cloud is currently in private beta. To use the API, you need to register an API key and an engineer will contact you.