Observability

Mixpeek provides multiple observability surfaces: health endpoints, task metadata, Ray dashboards, analytics APIs, and webhook histories. Combine them to detect regressions early and debug production issues quickly.

Health & Status

GET /v1/health – checks MongoDB, Qdrant, Redis, Celery, Engine, and ClickHouse (if analytics enabled). Returns OK or DEGRADED with per-service errors.
Tasks API – /v1/tasks/{task_id} and /v1/tasks/list expose status for batches, clustering jobs, taxonomy materialization, and migrations. All tasks use TaskStatusEnum.
Webhooks – webhook events recorded in MongoDB provide a durable log of ingestion and enrichment milestones (collection.documents.written, etc.).

Engine Monitoring

Ray Dashboard (port 8265) – view worker health, task timelines, Serve deployments, resource utilization, and logs.
Ray logs – pod logs (Kubernetes) or Ray CLI provide detailed extractor and clustering output (ray logs <job_id>).
Serve metrics – per-model latency and request counts; scrape via Prometheus or Ray metrics endpoint.

Analytics APIs

Enable analytics (ENABLE_ANALYTICS=true) to populate ClickHouse-backed metrics:

Endpoint	Insight
`/v1/analytics/retrievers/{id}/performance`	Query volume, latency percentiles
`/v1/analytics/retrievers/{id}/stages`	Stage-level timing and candidate counts
`/v1/analytics/retrievers/{id}/signals`	Cache hits, rerank scores, filter reductions
`/v1/analytics/retrievers/{id}/cache-performance`	Hit/miss rates and latency delta
`/v1/analytics/retrievers/{id}/slow-queries`	Top slow queries with execution context
`/v1/analytics/usage/summary`	Credit and resource usage (billing support)

Use these APIs to populate dashboards or feed alerting systems.

Logging & Tracing

API layer – structured JSON logs include request IDs, namespace, HTTP status, error codes, and downstream latency.
Celery workers – log task execution, retries, and webhook dispatch results.
Ray workers – include extractor metrics, batch IDs, and queue stats; aggregate logs centrally for long-term retention.
Correlation – propagate x-request-id from API to Engine jobs via additional_data.request_id to stitch traces together.

Metrics to Track

Component	Key Metrics
API	Request rate, p95 latency, error rate, rate-limit hits
Celery	Queue depth, task execution time, retry count
Ray	Worker utilization (CPU/GPU), job duration, Serve requests in flight
MongoDB	Operation latency, primary health, replication lag
Qdrant	Memory usage, search latency, vector count per namespace
Redis	Connection count, command latency, cache hit ratio

Integrate with Prometheus, Datadog, or your preferred metrics stack via existing exporters or custom scrapers.

Alerting Playbook

Latency spike → check retriever analytics, stage statistics, and Ray Serve load.
Task backlog → inspect Celery queue length, Redis health, and Ray worker availability.
Failed enrichment → query /v1/tasks/list for FAILED, inspect error_message, review webhook events.
Storage saturation → monitor Qdrant RAM usage and MongoDB disk consumption; scale storage or shard by namespace.
Cache regression → view cache hit-rate endpoint; adjust TTLs or stage cache configuration.

Dashboards to Build

API dashboard – health endpoint status, request latency, error breakdown, rate-limit counters.
Engine dashboard – Ray worker utilization, job runtime percentiles, extractor throughput, Serve queue depth.
Retrieval performance – retriever analytics charts (latency, cache hits, slow queries).
Storage dashboard – MongoDB/Redis/Qdrant metrics for capacity planning.
Task tracker – open tasks by status, median processing times, failure rates.

Incident Response Tips

Keep runbooks for common failures (e.g., extractor timeouts, Qdrant restarts).
Use webhook history to confirm whether ingestion completed or stalled.
Capture Ray job IDs from task metadata to replay logs quickly.
Snapshot retriever and collection configurations when debugging to ensure you’re reproducing the same pipeline.

With health checks, task metadata, analytics APIs, and Ray observability, you can confidently operate Mixpeek in production and catch issues before users notice.

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Health & Status

Engine Monitoring

Analytics APIs

Logging & Tracing

Metrics to Track

Alerting Playbook

Dashboards to Build

Incident Response Tips

Getting Started

Ingest Data

Process Data

Search & Retrieve

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Health & Status

​Engine Monitoring

​Analytics APIs

​Logging & Tracing

​Metrics to Track

​Alerting Playbook

​Dashboards to Build

​Incident Response Tips

Health & Status

Engine Monitoring

Analytics APIs

Logging & Tracing

Metrics to Track

Alerting Playbook

Dashboards to Build

Incident Response Tips