Batching

Batches let you process many objects in one asynchronous job. They store the list of object IDs, generate extractor manifests, and provide a task handle so you can monitor progress.

Create batch

Supply object IDs (or create an empty batch and add objects later).

Submit batch

API flattens manifests into per-extractor Parquet artifacts and writes them to S3.

Engine processes

Ray pollers pick up the batch, execute extractors tier-by-tier, and write documents to Qdrant.

Webhook & cache updates

Engine emits webhook events, Celery Beat invalidates caches, and collection schemas update.

Create a Batch

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_name": "products-2025-10-28",
    "object_ids": ["obj_abc", "obj_def"]
  }'

Add more objects later:

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_xyz"] }'

Submit for Processing

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "include_processing_history": true }'

include_processing_history=true records each enrichment operation in internal_metadata.processing_history.
Response contains a task_id; poll /v1/tasks/{task_id} or the batch resource directly.

Lifecycle & Status

Status	Meaning
`DRAFT`	Created but not submitted
`QUEUED`	Submitted; waiting for poller pickup
`PROCESSING`	Ray job running feature extractors
`COMPLETED`	All extractors finished successfully
`FAILED`	Extractors or Ray job failed (see `error_message`)

Status updates synchronize to both the batch resource and the associated task.

Under the Hood

API writes manifest metadata to MongoDB and extractor row artifacts to S3.
Ray poller queries MongoDB every 5 seconds for PENDING batches.
Poller submits a Ray job with manifest details.
Worker downloads artifacts, runs extractors in dependency tiers, and writes documents to Qdrant/MongoDB.
QdrantBatchProcessor emits webhook events and updates collection index signatures.

Monitoring

GET /v1/buckets/<bucket_id>/batches/<batch_id> – check batch status and manifest metadata.
GET /v1/tasks/<task_id> – track task-level progress (Redis TTL ≈ 24h).
Webhook events (collection.documents.written) notify you when documents land.
Analytics (coming soon) provide throughput metrics for Extractor + Engine performance.

Scaling Tips

Chunk large imports into batches of 1k–10k objects to keep pollers responsive.
Parallelize submissions—pollers handle multiple batches concurrently.
Use namespaces to isolate environments; pollers are namespace-aware.
Retry safely—batch submission and task polling are idempotent.
Pipeline scheduling—combine Celery Beat or your orchestrator to submit batches on cron.

Batching keeps ingestion resilient and scalable—separate raw uploads from heavy compute, then let the Engine take over on its own schedule.

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Create a Batch

Submit for Processing

Lifecycle & Status

Under the Hood

Monitoring

Scaling Tips

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Create a Batch

​Submit for Processing

​Lifecycle & Status

​Under the Hood

​Monitoring

​Scaling Tips

​Related APIs

Create a Batch

Submit for Processing

Lifecycle & Status

Under the Hood

Monitoring

Scaling Tips

Related APIs