Skip to main content
Mixpeek provides multiple observability surfaces: health endpoints, task metadata, Ray dashboards, analytics APIs, and webhook histories. Combine them to detect regressions early and debug production issues quickly.

Health & Status

  • GET /v1/health – checks MongoDB, Qdrant, Redis, Celery, Engine, and ClickHouse (if analytics enabled). Returns OK or DEGRADED with per-service errors.
  • Tasks API/v1/tasks/{task_id} and /v1/tasks/list expose status for batches, clustering jobs, taxonomy materialization, and migrations. All tasks use TaskStatusEnum.
  • Webhooks – webhook events recorded in MongoDB provide a durable log of ingestion and enrichment milestones (collection.documents.written, etc.).

Engine Monitoring

  • Ray Dashboard (port 8265) – view worker health, task timelines, Serve deployments, resource utilization, and logs.
  • Ray logs – pod logs (Kubernetes) or Ray CLI provide detailed extractor and clustering output (ray logs <job_id>).
  • Serve metrics – per-model latency and request counts; scrape via Prometheus or Ray metrics endpoint.

Analytics APIs

Enable analytics (ENABLE_ANALYTICS=true) to populate ClickHouse-backed metrics:
EndpointInsight
/v1/analytics/retrievers/{id}/performanceQuery volume, latency percentiles
/v1/analytics/retrievers/{id}/stagesStage-level timing and candidate counts
/v1/analytics/retrievers/{id}/signalsCache hits, rerank scores, filter reductions
/v1/analytics/retrievers/{id}/cache-performanceHit/miss rates and latency delta
/v1/analytics/retrievers/{id}/slow-queriesTop slow queries with execution context
/v1/analytics/usage/summaryCredit and resource usage (billing support)
Use these APIs to populate dashboards or feed alerting systems.

Logging & Tracing

  • API layer – structured JSON logs include request IDs, namespace, HTTP status, error codes, and downstream latency.
  • Celery workers – log task execution, retries, and webhook dispatch results.
  • Ray workers – include extractor metrics, batch IDs, and queue stats; aggregate logs centrally for long-term retention.
  • Correlation – propagate x-request-id from API to Engine jobs via additional_data.request_id to stitch traces together.

Metrics to Track

ComponentKey Metrics
APIRequest rate, p95 latency, error rate, rate-limit hits
CeleryQueue depth, task execution time, retry count
RayWorker utilization (CPU/GPU), job duration, Serve requests in flight
MongoDBOperation latency, primary health, replication lag
QdrantMemory usage, search latency, vector count per namespace
RedisConnection count, command latency, cache hit ratio
Integrate with Prometheus, Datadog, or your preferred metrics stack via existing exporters or custom scrapers.

Alerting Playbook

  1. Latency spike → check retriever analytics, stage statistics, and Ray Serve load.
  2. Task backlog → inspect Celery queue length, Redis health, and Ray worker availability.
  3. Failed enrichment → query /v1/tasks/list for FAILED, inspect error_message, review webhook events.
  4. Storage saturation → monitor Qdrant RAM usage and MongoDB disk consumption; scale storage or shard by namespace.
  5. Cache regression → view cache hit-rate endpoint; adjust TTLs or stage cache configuration.

Dashboards to Build

  • API dashboard – health endpoint status, request latency, error breakdown, rate-limit counters.
  • Engine dashboard – Ray worker utilization, job runtime percentiles, extractor throughput, Serve queue depth.
  • Retrieval performance – retriever analytics charts (latency, cache hits, slow queries).
  • Storage dashboard – MongoDB/Redis/Qdrant metrics for capacity planning.
  • Task tracker – open tasks by status, median processing times, failure rates.

Incident Response Tips

  • Keep runbooks for common failures (e.g., extractor timeouts, Qdrant restarts).
  • Use webhook history to confirm whether ingestion completed or stalled.
  • Capture Ray job IDs from task metadata to replay logs quickly.
  • Snapshot retriever and collection configurations when debugging to ensure you’re reproducing the same pipeline.
With health checks, task metadata, analytics APIs, and Ray observability, you can confidently operate Mixpeek in production and catch issues before users notice.