Evaluations

Evaluations answer: “How good is my retriever?” by running queries from a ground truth dataset and comparing the results against known-relevant documents.

Metrics

Metric	Formula	Prioritize When
Precision@K	(relevant in top K) / K	You need high accuracy in top results
Recall@K	(relevant in top K) / (total relevant)	You need to find all relevant items
F1@K	Harmonic mean of Precision and Recall	You need a balance of both
NDCG@K	Normalized Discounted Cumulative Gain	Ranking order matters (position-sensitive)
MAP	Mean Average Precision across queries	Overall retrieval quality
MRR	1 / (position of first relevant result)	Users care about the first good result

Interpreting Scores

NDCG@10 = 0.89 — Your top-10 ranking captures 89% of an ideal ordering. Most relevant docs appear near the top.
Precision@10 = 0.75 — 7.5 out of every 10 results are relevant. Users see mostly relevant content.
MRR = 0.93 — On average, the first relevant result appears between position 1 and 2. Users find what they need quickly.
Recall@20 = 0.60 — You’re surfacing 60% of all relevant documents in the top 20. Good for discovery; may need a higher K or better embeddings for completeness.

Running an Evaluation

Create a ground truth dataset

A dataset is a collection of queries, each paired with the document IDs that are considered relevant.

curl -X POST "$MP_API_URL/v1/retrievers/evaluations/datasets" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{
    "dataset_name": "product-search-golden",
    "queries": [
      {
        "query_id": "q1",
        "query_input": {"query": "wireless earbuds"},
        "relevant_documents": ["doc_a1", "doc_a2", "doc_a3"]
      },
      {
        "query_id": "q2",
        "query_input": {"query": "noise canceling headphones"},
        "relevant_documents": ["doc_b1", "doc_b4", "doc_b7"]
      },
      {
        "query_id": "q3",
        "query_input": {"query": "running headphones waterproof"},
        "relevant_documents": ["doc_c2", "doc_c5"]
      }
    ]
  }'

Run the evaluation

Execute your retriever against every query in the dataset and compute metrics.

curl -X POST "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{
    "dataset_name": "product-search-golden",
    "evaluation_config": {
      "k_values": [1, 5, 10, 20],
      "metrics": ["precision", "recall", "f1", "ndcg", "map", "mrr"]
    }
  }'

This returns a task_id and evaluation_id. The evaluation runs asynchronously.

Get results

Poll the evaluation until it completes:

curl "$MP_API_URL/v1/retrievers/{retriever_id}/evaluations/{evaluation_id}" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE"

Response:

{
  "evaluation_id": "eval_abc123",
  "status": "completed",
  "query_count": 3,
  "overall_metrics": {
    "precision": 0.75,
    "recall": 0.82,
    "f1_score": 0.78,
    "mean_average_precision": 0.71
  },
  "metrics_by_k": {
    "1": {"precision": 0.67, "recall": 0.28, "ndcg": 0.67, "mrr": 0.67},
    "5": {"precision": 0.73, "recall": 0.65, "ndcg": 0.81, "mrr": 0.89},
    "10": {"precision": 0.75, "recall": 0.82, "ndcg": 0.89, "mrr": 0.93},
    "20": {"precision": 0.68, "recall": 0.95, "ndcg": 0.91, "mrr": 0.93}
  }
}

Comparing Retriever Configurations

Run the same dataset against different retrievers to compare configurations:

# Evaluate baseline (RRF fusion)
curl -X POST "$MP_API_URL/v1/retrievers/ret_baseline/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'

# Evaluate candidate (learned fusion)
curl -X POST "$MP_API_URL/v1/retrievers/ret_learned/evaluations" \
  -H "Authorization: Bearer $MP_API_KEY" \
  -H "X-Namespace: $MP_NAMESPACE" \
  -d '{"dataset_name": "product-search-golden"}'

Then compare the metrics_by_k results side by side. If the candidate shows higher NDCG@10 and MAP, it’s producing better rankings.

For live traffic comparison, use Benchmarks which replay real user sessions. Evaluations are for offline measurement with curated ground truth.

Configuration Reference

dataset_name

string

required

Name of the ground truth dataset to evaluate against.

evaluation_config.k_values

array

default:"[1, 5, 10, 20]"

K values for computing Precision@K, Recall@K, NDCG@K, and F1@K. Include the K values that match your UI (e.g., if you show 10 results per page, include 10).

evaluation_config.metrics

array

Which metrics to compute. Available: precision, recall, f1, map, ndcg, mrr.

Best Practices

Use 50+ queries — Small datasets produce noisy metrics. Aim for at least 50 representative queries.
Include head, torso, and tail queries — Don’t just test popular queries. Include rare, long-tail queries that are harder for the retriever.
Test multiple K values — A retriever might have great Precision@5 but poor Recall@20. Multiple K values reveal the full picture.
Re-evaluate after changes — Run evaluations after changing fusion strategy, adding rerank stages, or updating embeddings.
Version your datasets — Keep ground truth datasets stable over time so you can track metric trends across retriever versions.

Evaluations vs Interactions vs Benchmarks

Evaluations
Interactions
Benchmarks

What: Offline measurement against curated ground truth datasets.When: Before deploying changes. After updating embeddings, fusion strategy, or stages.Input: Ground truth query-document pairs you create.Output: NDCG, Precision, Recall, MAP, MRR, F1 at multiple K values.

Benchmarks — live session replay comparison
Analytics — production monitoring and slow query detection
Interaction Signals — building ground truth from user behavior
API Reference: Run Evaluation — full API specification
API Reference: Create Dataset — dataset creation API

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

Metrics

Interpreting Scores

Running an Evaluation

Comparing Retriever Configurations

Configuration Reference

Best Practices

Evaluations vs Interactions vs Benchmarks

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​Metrics

​Interpreting Scores

​Running an Evaluation

​Comparing Retriever Configurations

​Configuration Reference

​Best Practices

​Evaluations vs Interactions vs Benchmarks

​Related

Metrics

Interpreting Scores

Running an Evaluation

Comparing Retriever Configurations

Configuration Reference

Best Practices

Evaluations vs Interactions vs Benchmarks

Related