Skip to main content
Sample stage showing random and stratified sampling of results
The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.
Stage Category: REDUCE (Samples documents)Transformation: N documents → S sampled documents (where S ≤ N)

When to Use

Use CaseDescription
Representative samplesGet a sample of large result sets
A/B testingRandom document selection
Stratified selectionEqual representation per category
Cost reductionSample before expensive operations

When NOT to Use

ScenarioRecommended Alternative
Top-N by relevancelimit or rerank
Diversity by similaritymmr
Remove duplicatesdeduplicate
All results neededSkip sampling

Parameters

ParameterTypeDefaultDescription
methodstringrandomSampling method: random, stratified, systematic
sizeinteger10Number of samples to return
seedintegerrandomRandom seed for reproducibility
group_fieldstringnoneField for stratified sampling
samples_per_groupintegerautoSamples per stratum

Sampling Methods

MethodDescriptionBest For
randomUniform random selectionGeneral sampling
stratifiedEqual samples per groupCategory balance
systematicEvery Nth documentOrdered data

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "sample",
  "parameters": {
    "method": "random",
    "size": 20
  }
}

How Sampling Works

Random Sampling

Selects documents with uniform probability:
Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(size=3): [D, G, B] (random selection)

Stratified Sampling

Ensures representation from each group:
Input:
  - Category A: [A1, A2, A3, A4, A5]
  - Category B: [B1, B2, B3]
  - Category C: [C1, C2]

Stratified(samples_per_group=2):
  [A1, A3, B2, B1, C1, C2]

Systematic Sampling

Selects every Nth document:
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Systematic(size=3): [1, 4, 7] (every 3rd, starting from 1)

Output Schema

{
  "documents": [
    {
      "document_id": "doc_123",
      "content": "Sampled document content...",
      "score": 0.85,
      "sample": {
        "method": "stratified",
        "stratum": "electronics",
        "sample_index": 0
      }
    }
  ],
  "metadata": {
    "method": "stratified",
    "total_input": 100,
    "sample_size": 15,
    "strata": {
      "electronics": {"input": 45, "sampled": 5},
      "clothing": {"input": 35, "sampled": 5},
      "books": {"input": 20, "sampled": 5}
    }
  }
}

Performance

MetricValue
Latency< 5ms
MemoryO(N)
CostFree
ComplexityO(N) random, O(N log N) stratified

Common Pipeline Patterns

Search + Sample for Testing

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 1000
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "random",
      "size": 50,
      "seed": 42
    }
  }
]

Balanced Category Sample

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 500
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "stratified",
      "group_field": "metadata.category",
      "samples_per_group": 5
    }
  }
]

Sample Before LLM Processing

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 200
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 50
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "random",
      "size": 10
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrichment",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Extract key insights",
      "output_field": "insights"
    }
  }
]

Cluster + Sample Representatives

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 200
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "cluster",
    "parameters": {
      "num_clusters": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "stratified",
      "group_field": "cluster.cluster_id",
      "samples_per_group": 2
    }
  }
]

Multi-Source Balanced Sample

[
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.date",
        "operator": "gte",
        "value": "2024-01-01"
      }
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "sample",
    "parameters": {
      "method": "stratified",
      "group_field": "metadata.source",
      "samples_per_group": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "summarize",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Compare perspectives from different sources on this topic"
    }
  }
]

Stratified Sampling Details

Equal Allocation

{
  "method": "stratified",
  "group_field": "metadata.category",
  "samples_per_group": 5
}
Each group gets exactly 5 samples (if available).

Proportional Allocation

{
  "method": "stratified",
  "group_field": "metadata.category",
  "size": 30,
  "allocation": "proportional"
}
Samples proportional to group size.

Reproducibility

Use seed for reproducible results:
{
  "method": "random",
  "size": 20,
  "seed": 12345
}
Same seed + same input = same output.

Error Handling

ErrorBehavior
size > inputReturn all documents
Empty stratumSkip that stratum
Invalid group_fieldFall back to random
size = 0Return empty

Sample vs Other Reduction Stages

StageSelection BasisDeterministic
sampleRandom/StratifiedWith seed
limitPositionYes
mmrDiversity + relevanceYes
deduplicateUniquenessYes