Create a Benchmark
| Parameter | Description |
|---|---|
baseline_retriever_id | Your current production retriever |
candidate_retriever_ids | Configurations to test against baseline |
session_filter | Which historical sessions to replay |
session_count | Number of sessions to include |
Execute Benchmark
PENDING → BUILDING_SESSIONS → REPLAYING → COMPUTING_METRICS → COMPLETED
Get Results
Metrics
| Metric | Description |
|---|---|
precision_at_k | Fraction of top-K results that were relevant |
recall_at_k | Fraction of relevant items found in top-K |
mrr | Mean Reciprocal Rank - position of first relevant result |
ndcg_at_k | Normalized Discounted Cumulative Gain |
avg_latency_ms | Average execution time |
Session Filters
| Filter | Description |
|---|---|
created_after / created_before | Time range for sessions |
retriever_ids | Only sessions from specific retrievers |
min_interactions | Minimum user interactions (clicks, feedback) |
has_positive_feedback | Sessions with positive signals only |
Sessions with interactions provide ground truth for relevance. Use
min_interactions: 1 to ensure meaningful comparison data.Limitations
- Replays the query, not real-time user context
- Relevance inferred from historical interactions
- Results may vary if collection data changed since original sessions

