The Sample stage selects a subset of documents from your results using random or stratified sampling. This is useful for creating representative samples, reducing result sets, or ensuring diversity across categories.
Stage Category : REDUCE (Samples documents)Transformation : N documents → S sampled documents (where S ≤ N)
When to Use
Use Case Description Representative samples Get a sample of large result sets A/B testing Random document selection Stratified selection Equal representation per category Cost reduction Sample before expensive operations
When NOT to Use
Scenario Recommended Alternative Top-N by relevance limit or rerankDiversity by similarity mmrRemove duplicates deduplicateAll results needed Skip sampling
Parameters
Parameter Type Default Description methodstring randomSampling method: random, stratified, systematic sizeinteger 10Number of samples to return seedinteger random Random seed for reproducibility group_fieldstring none Field for stratified sampling samples_per_groupinteger auto Samples per stratum
Sampling Methods
Method Description Best For randomUniform random selection General sampling stratifiedEqual samples per group Category balance systematicEvery Nth document Ordered data
Configuration Examples
Random Sample
Reproducible Random Sample
Stratified by Category
Systematic Sampling
Stratified with Total Limit
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "random" ,
"size" : 20
}
}
How Sampling Works
Random Sampling
Selects documents with uniform probability:
Input: [A, B, C, D, E, F, G, H, I, J] (10 docs)
Sample(size=3): [D, G, B] (random selection)
Stratified Sampling
Ensures representation from each group:
Input:
- Category A: [A1, A2, A3, A4, A5]
- Category B: [B1, B2, B3]
- Category C: [C1, C2]
Stratified(samples_per_group=2):
[A1, A3, B2, B1, C1, C2]
Systematic Sampling
Selects every Nth document:
Input: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Systematic(size=3): [1, 4, 7] (every 3rd, starting from 1)
Output Schema
{
"documents" : [
{
"document_id" : "doc_123" ,
"content" : "Sampled document content..." ,
"score" : 0.85 ,
"sample" : {
"method" : "stratified" ,
"stratum" : "electronics" ,
"sample_index" : 0
}
}
],
"metadata" : {
"method" : "stratified" ,
"total_input" : 100 ,
"sample_size" : 15 ,
"strata" : {
"electronics" : { "input" : 45 , "sampled" : 5 },
"clothing" : { "input" : 35 , "sampled" : 5 },
"books" : { "input" : 20 , "sampled" : 5 }
}
}
}
Metric Value Latency < 5ms Memory O(N) Cost Free Complexity O(N) random, O(N log N) stratified
Common Pipeline Patterns
Search + Sample for Testing
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 1000
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "random" ,
"size" : 50 ,
"seed" : 42
}
}
]
Balanced Category Sample
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 500
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "stratified" ,
"group_field" : "metadata.category" ,
"samples_per_group" : 5
}
}
]
Sample Before LLM Processing
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 200
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "bge-reranker-v2-m3" ,
"top_n" : 50
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "random" ,
"size" : 10
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrichment" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Extract key insights" ,
"output_field" : "insights"
}
}
]
Cluster + Sample Representatives
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 200
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "cluster" ,
"parameters" : {
"num_clusters" : 10
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "stratified" ,
"group_field" : "cluster.cluster_id" ,
"samples_per_group" : 2
}
}
]
Multi-Source Balanced Sample
[
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.date" ,
"operator" : "gte" ,
"value" : "2024-01-01"
}
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "sample" ,
"parameters" : {
"method" : "stratified" ,
"group_field" : "metadata.source" ,
"samples_per_group" : 10
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "summarize" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Compare perspectives from different sources on this topic"
}
}
]
Stratified Sampling Details
Equal Allocation
{
"method" : "stratified" ,
"group_field" : "metadata.category" ,
"samples_per_group" : 5
}
Each group gets exactly 5 samples (if available).
Proportional Allocation
{
"method" : "stratified" ,
"group_field" : "metadata.category" ,
"size" : 30 ,
"allocation" : "proportional"
}
Samples proportional to group size.
Reproducibility
Use seed for reproducible results:
{
"method" : "random" ,
"size" : 20 ,
"seed" : 12345
}
Same seed + same input = same output.
Error Handling
Error Behavior size > input Return all documents Empty stratum Skip that stratum Invalid group_field Fall back to random size = 0 Return empty
Sample vs Other Reduction Stages
Stage Selection Basis Deterministic sampleRandom/Stratified With seed limitPosition Yes mmrDiversity + relevance Yes deduplicateUniqueness Yes