Metrics
| Metric | Formula | Prioritize When |
|---|---|---|
| Precision@K | (relevant in top K) / K | You need high accuracy in top results |
| Recall@K | (relevant in top K) / (total relevant) | You need to find all relevant items |
| F1@K | Harmonic mean of Precision and Recall | You need a balance of both |
| NDCG@K | Normalized Discounted Cumulative Gain | Ranking order matters (position-sensitive) |
| MAP | Mean Average Precision across queries | Overall retrieval quality |
| MRR | 1 / (position of first relevant result) | Users care about the first good result |
Interpreting Scores
- NDCG@10 = 0.89 — Your top-10 ranking captures 89% of an ideal ordering. Most relevant docs appear near the top.
- Precision@10 = 0.75 — 7.5 out of every 10 results are relevant. Users see mostly relevant content.
- MRR = 0.93 — On average, the first relevant result appears between position 1 and 2. Users find what they need quickly.
- Recall@20 = 0.60 — You’re surfacing 60% of all relevant documents in the top 20. Good for discovery; may need a higher K or better embeddings for completeness.
Running an Evaluation
Create a ground truth dataset
A dataset is a collection of queries, each paired with the document IDs that are considered relevant.
Run the evaluation
Execute your retriever against every query in the dataset and compute metrics.This returns a
task_id and evaluation_id. The evaluation runs asynchronously.Comparing Retriever Configurations
Run the same dataset against different retrievers to compare configurations:metrics_by_k results side by side. If the candidate shows higher NDCG@10 and MAP, it’s producing better rankings.
Configuration Reference
Name of the ground truth dataset to evaluate against.
K values for computing Precision@K, Recall@K, NDCG@K, and F1@K. Include the K values that match your UI (e.g., if you show 10 results per page, include
10).Which metrics to compute. Available:
precision, recall, f1, map, ndcg, mrr.Best Practices
- Use 50+ queries — Small datasets produce noisy metrics. Aim for at least 50 representative queries.
- Include head, torso, and tail queries — Don’t just test popular queries. Include rare, long-tail queries that are harder for the retriever.
- Test multiple K values — A retriever might have great Precision@5 but poor Recall@20. Multiple K values reveal the full picture.
- Re-evaluate after changes — Run evaluations after changing fusion strategy, adding rerank stages, or updating embeddings.
- Version your datasets — Keep ground truth datasets stable over time so you can track metric trends across retriever versions.
Evaluations vs Interactions vs Benchmarks
- Evaluations
- Interactions
- Benchmarks
What: Offline measurement against curated ground truth datasets.When: Before deploying changes. After updating embeddings, fusion strategy, or stages.Input: Ground truth query-document pairs you create.Output: NDCG, Precision, Recall, MAP, MRR, F1 at multiple K values.
Related
- Benchmarks — live session replay comparison
- Analytics — production monitoring and slow query detection
- Interaction Signals — building ground truth from user behavior
- API Reference: Run Evaluation — full API specification
- API Reference: Create Dataset — dataset creation API

