Skip to main content
Batches let you process many objects in one asynchronous job. They store the list of object IDs, generate extractor manifests, and provide a task handle so you can monitor progress.
1

Create batch

Supply object IDs (or create an empty batch and add objects later).
2

Submit batch

API flattens manifests into per-extractor Parquet artifacts and writes them to S3.
3

Engine processes

Ray pollers pick up the batch, execute extractors tier-by-tier, and write documents to Qdrant.
4

Webhook & cache updates

Engine emits webhook events, Celery Beat invalidates caches, and collection schemas update.

Create a Batch

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{
    "batch_name": "products-2025-10-28",
    "object_ids": ["obj_abc", "obj_def"]
  }'
Add more objects later:
curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/objects" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "object_ids": ["obj_xyz"] }'

Submit for Processing

curl -sS -X POST "$MP_API_URL/v1/buckets/<bucket_id>/batches/<batch_id>/submit" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -H "Content-Type: application/json" \
  -d '{ "include_processing_history": true }'
  • include_processing_history=true records each enrichment operation in internal_metadata.processing_history.
  • Response contains a task_id; poll /v1/tasks/{task_id} or the batch resource directly.

Lifecycle & Status

StatusMeaning
DRAFTCreated but not submitted
QUEUEDSubmitted; waiting for poller pickup
PROCESSINGRay job running feature extractors
COMPLETEDAll extractors finished successfully
FAILEDExtractors or Ray job failed (see error_message)
Status updates synchronize to both the batch resource and the associated task.

Under the Hood

  1. API writes manifest metadata to MongoDB and extractor row artifacts to S3.
  2. Ray poller queries MongoDB every 5 seconds for PENDING batches.
  3. Poller submits a Ray job with manifest details.
  4. Worker downloads artifacts, runs extractors in dependency tiers, and writes documents to Qdrant/MongoDB.
  5. QdrantBatchProcessor emits webhook events and updates collection index signatures.

Monitoring

  • GET /v1/buckets/<bucket_id>/batches/<batch_id> – check batch status and manifest metadata.
  • GET /v1/tasks/<task_id> – track task-level progress (Redis TTL ≈ 24h).
  • Webhook events (collection.documents.written) notify you when documents land.
  • Analytics (coming soon) provide throughput metrics for Extractor + Engine performance.

Scaling Tips

  • Chunk large imports into batches of 1k–10k objects to keep pollers responsive.
  • Parallelize submissions—pollers handle multiple batches concurrently.
  • Use namespaces to isolate environments; pollers are namespace-aware.
  • Retry safely—batch submission and task polling are idempotent.
  • Pipeline scheduling—combine Celery Beat or your orchestrator to submit batches on cron.
Batching keeps ingestion resilient and scalable—separate raw uploads from heavy compute, then let the Engine take over on its own schedule.