What is Mixpeek?

Your software likely stores unstructured and structured data across various datatypes: documents, images, video and audio not the least of which is text.

Unstructured data tends to live in an object store like Amazon S3, Azure Blob Store or Google Cloud Storage. Then structured in your database like Postgres or MongoDB.

Mixpeek allows you to setup data processing pipelines, so as new files are added to your object store like pdfs, pictures, videos, and recordings you’ll immediately have the structured, application-ready data to work with. Always.

How does it work?

It is a chain of user-defined tasks that are applied to your objects as they’re added to your own storage, processed then sent to your transactional database.

This allows you to structure your unstructured data using custom, powerful logic in real-time.

mixpeek diagram

Eventually Consistent: By using distributed queues, we ensure all objects get processed and atomically inserted, updated or deleted in your database.

Completely Synchronized: It allows you to treat your object storage and database as one single entity, minimizing context shifting.

Guaranteed Execution: Every object storage change, regardless of how many pipelines or how intense the pipeline is completed, even under high load or with failures.

Infinitely Flexible: Any combination of extract, embed and generate methods that span all four modalities to chain into pipelines.

A Simple Example

Say you upload an image: dog.png to your S3 bucket and you want to extract the tags and create an embedding of the image itself.

You’d create a Mixpeek pipeline that combines ML models and gets invoked whenever there’s a new object in your bucket.

Your first pipeline

Here’s an example pipeline that creates an embedding and tags.

def function(event, context):
  mixpeek = Mixpeek("API_KEY")

  # create an embedding
  embedding = mixpeek.embed.image(model_id="openai/clip-vit-base-patch32")
  # create tags
  tags = mixpeek.extract.text(model_id="microsoft/conditional-detr-resnet-50")

  return {
    "file_url": event.object_url,
    "tags": tags,
    "embedding": embedding

Create pipeline documentation

Once we create this pipeline then connect our S3 bucket and finally MongoDB collection then enable it.

Your first connection

Every new account comes with a preconfigured S3 and MongoDB connection so you can get started right away. Alternatively, you bring your own.

default connections

Create connection documentation

Setup pipelines

Two options for pipeline creation

Previously, we would have created an AWS IAM role, which opens up a listener on your S3 bucket dogs, every new object gets sent through the pipeline-as-code we defined above and then sent into our MongoDB collection: dog_objects (which was also instantiated previously).

That’s really it! Mixpeek is designed to be “set and forget”, never think about processing your S3 bucket again.

Sample output

Here’s an example output from the S3 object: dog.png that gets sent into your MongoDB collection:

  "object_url": "s3://dog.png",
  "text": "australian shepherd",
  "tags": ["dog", "shepherd"],
  "embedding": [0, 1, 2, 3],
  "metadata": {
    "pipeline_version": "v1"

Use the Methods directly

You can also use the extract, embed and generate methods outside of a pipeline.

from mixpeek import Mixpeek

mixpeek = Mixpeek("API_KEY")

output = mixpeek.embed.text(
  input="lorem ipsum",

To use the API, you need to register an API key and an engineer will contact you.

What does it enable?

Since you’ll have fresh vectors, metadata and extracted contents you can design hyper-targeted queries that span all your use cases:

  • RAG (Retrieval Augmented Generation)
  • Recommendation Systems
  • Hybrid Search Engines

All without having to think about data prep again. You can even modify your pipeline, and the new version will be appended to the metadata.pipeline_version key, so you can filter your pipeline code changes against the output data.

Was this page helpful?