Skip to main content
Benchmarks replay real user sessions against multiple retriever configurations to measure how changes affect search quality before deploying to production.

Create a Benchmark

curl -X POST "$MP_API_URL/v1/retrievers/benchmarks" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{
    "benchmark_name": "semantic_vs_hybrid",
    "baseline_retriever_id": "ret_semantic_v1",
    "candidate_retriever_ids": ["ret_hybrid_v1"],
    "session_filter": {
      "created_after": "2025-01-01T00:00:00Z",
      "min_interactions": 1
    },
    "session_count": 100
  }'
ParameterDescription
baseline_retriever_idYour current production retriever
candidate_retriever_idsConfigurations to test against baseline
session_filterWhich historical sessions to replay
session_countNumber of sessions to include

Execute Benchmark

curl -X POST "$MP_API_URL/v1/retrievers/benchmarks/{benchmark_id}/execute" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
Execution progresses through: PENDINGBUILDING_SESSIONSREPLAYINGCOMPUTING_METRICSCOMPLETED

Get Results

curl "$MP_API_URL/v1/retrievers/benchmarks/{benchmark_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"
{
  "benchmark_id": "bench_abc123",
  "status": "COMPLETED",
  "comparison": {
    "baseline": {
      "retriever_id": "ret_semantic_v1",
      "precision_at_10": 0.72,
      "mrr": 0.81,
      "avg_latency_ms": 145
    },
    "candidates": [
      {
        "retriever_id": "ret_hybrid_v1",
        "precision_at_10": 0.78,
        "mrr": 0.85,
        "avg_latency_ms": 168,
        "delta": { "precision_at_10": "+8.3%", "mrr": "+4.9%" }
      }
    ]
  }
}

Metrics

MetricDescription
precision_at_kFraction of top-K results that were relevant
recall_at_kFraction of relevant items found in top-K
mrrMean Reciprocal Rank - position of first relevant result
ndcg_at_kNormalized Discounted Cumulative Gain
avg_latency_msAverage execution time

Session Filters

FilterDescription
created_after / created_beforeTime range for sessions
retriever_idsOnly sessions from specific retrievers
min_interactionsMinimum user interactions (clicks, feedback)
has_positive_feedbackSessions with positive signals only
Sessions with interactions provide ground truth for relevance. Use min_interactions: 1 to ensure meaningful comparison data.

Limitations

  • Replays the query, not real-time user context
  • Relevance inferred from historical interactions
  • Results may vary if collection data changed since original sessions