Skip to main content
Objects are the ingestion entry point. They live inside buckets, reference one or more blobs (files or JSON), and carry metadata that downstream collections preserve in lineage.
Creating an object does not trigger feature extraction. Processing happens when you create and submit a batch.

Object Schema

{
  "key_prefix": "/products/red-sneaker",
  "metadata": {
    "category": "footwear",
    "brand": "Acme"
  },
  "blobs": [
    {
      "property": "product_text",
      "type": "text",
      "data": "Comfortable sneaker with foam sole.",
      "metadata": {
        "language": "en"
      }
    },
    {
      "property": "hero_image",
      "type": "image",
      "data": "https://cdn.example.com/red-sneaker.jpg"
    }
  ]
}
  • key_prefix (optional) – Logical path to help organize downstream documents.
  • metadata (optional) – Arbitrary JSON copied into documents through field passthrough.
  • blobs (required) – Each entry must match a property defined in the bucket schema.
Supported blob types: text, json, image, video, audio, binary. Blob data can be raw content, a base64 payload, or a URL Mixpeek can fetch.

Create an Object

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ ... }'
Useful options:
  • skip_duplicates: avoid reprocessing identical blobs by content hash.
  • key_prefix: namespacing for logical groupings.
  • metadata: provide information used later by taxonomies or retrievers.

Retrieve Objects

curl -sS -X GET "$MP_API_URL/v1/buckets/<bucket_id>/objects/<object_id>?return_url=true" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
  • return_url=true generates presigned URLs for blobs (expires ≈ 1 hour).
  • Listing objects supports rich filtering and pagination: POST /v1/buckets/<bucket_id>/objects/list.

Lineage

Downstream documents retain:
{
  "root_object_id": "obj_123",
  "root_bucket_id": "bkt_catalog",
  "source_type": "bucket",
  "source_object_id": "obj_123",
  "source_blobs": [
    { "blob_id": "blob_abc", "blob_property": "product_text", "blob_type": "text" }
  ]
}
  • source_blobs link back to object blobs (without duplicating large content).
  • document_blobs contain extractor-generated artifacts (e.g., thumbnails).
  • To inspect the entire decomposition tree for an object, call /v1/objects/{object_id}/decomposition-tree.

Best Practices

  • Define bucket schemas up front so object validation fails fast.
  • Set metadata that retrievers or taxonomies will use for filtering.
  • Chunk large uploads into multiple objects instead of massive blobs for better parallelism.
  • Use batches (/v1/buckets/{bucket}/batches) to process groups of objects efficiently.
  • Track key prefixes to simplify downstream grouping or deduplication during retrieval.
Objects are the immutable source of truth. Once registered, they can feed any number of collections, extractors, and enrichment pipelines without re-uploading files.