Skip to main content
Group By stage showing document aggregation by field values
The Group By stage aggregates documents that share the same value for a specified field, creating logical groups. This is useful for organizing results by category, author, date, or any other attribute.
Stage Category: REDUCE (Groups documents)Transformation: N documents → G groups (where G = unique field values)

When to Use

Use CaseDescription
Category groupingGroup products by category
Author aggregationGroup articles by author
Date groupingGroup by day/month/year
Source organizationGroup by data source

When NOT to Use

ScenarioRecommended Alternative
Semantic similarity groupingcluster
Statistical aggregations onlyaggregate
Removing duplicatesdeduplicate
Top-N per groupUse with sample

Parameters

ParameterTypeDefaultDescription
fieldstringRequiredField to group by
max_groupsinteger100Maximum number of groups
sort_groups_bystringcountSort groups: count, field, score
sort_orderstringdescGroup sort order: asc, desc
docs_per_groupintegerallLimit documents per group
sort_docs_bystringscoreSort docs within group

Configuration Examples

{
  "stage_type": "reduce",
  "stage_id": "group_by",
  "parameters": {
    "field": "metadata.category"
  }
}

Output Schema

{
  "groups": [
    {
      "key": "electronics",
      "count": 25,
      "documents": [
        {
          "document_id": "doc_123",
          "content": "Latest smartphone review...",
          "score": 0.95,
          "metadata": {"category": "electronics", "price": 999}
        },
        {
          "document_id": "doc_456",
          "content": "Laptop comparison guide...",
          "score": 0.89,
          "metadata": {"category": "electronics", "price": 1299}
        }
      ]
    },
    {
      "key": "clothing",
      "count": 18,
      "documents": [...]
    }
  ],
  "metadata": {
    "total_groups": 5,
    "total_documents": 100,
    "field": "metadata.category"
  }
}

Performance

MetricValue
Latency5-20ms
MemoryO(N)
CostFree
ScalabilityEfficient

Common Pipeline Patterns

Search + Group by Category

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.category",
      "docs_per_group": 5
    }
  }
]

Grouped Results with Aggregations

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 200
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.brand",
      "sort_groups_by": "count",
      "max_groups": 10
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "aggregate",
    "parameters": {
      "aggregations": [
        {"type": "avg", "field": "metadata.price", "name": "avg_price"},
        {"type": "avg", "field": "metadata.rating", "name": "avg_rating"}
      ],
      "group_by": "metadata.brand"
    }
  }
]
[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 100
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "document_enrich",
    "parameters": {
      "collection_id": "authors",
      "lookup_field": "author_id",
      "source_field": "metadata.author_id",
      "result_field": "author"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "group_by",
    "parameters": {
      "field": "author.name",
      "docs_per_group": 3,
      "sort_groups_by": "count"
    }
  }
]

Time-Based Grouping

[
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.date",
        "operator": "gte",
        "value": "2024-01-01"
      }
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "code_execution",
    "parameters": {
      "code": "def transform(doc):\n    date = doc.get('metadata', {}).get('date', '')\n    doc['metadata']['month'] = date[:7]  # YYYY-MM\n    return doc"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "group_by",
    "parameters": {
      "field": "metadata.month",
      "sort_groups_by": "field",
      "sort_order": "desc"
    }
  }
]

Group Sorting Options

By Count (default)

{"sort_groups_by": "count", "sort_order": "desc"}
Groups with most documents first.

By Field Value

{"sort_groups_by": "field", "sort_order": "asc"}
Alphabetical or chronological ordering.

By Best Score

{"sort_groups_by": "score", "sort_order": "desc"}
Groups containing highest-scoring documents first.

Document Sorting Within Groups

Sort ByDescription
scoreRelevance score (default)
metadata.dateAny metadata field
_randomRandom order

Handling Missing Values

BehaviorDescription
null keyDocuments with missing field grouped as “null”
ExcludeSet exclude_null: true to skip
{
  "stage_type": "reduce",
  "stage_id": "group_by",
  "parameters": {
    "field": "metadata.category",
    "exclude_null": true
  }
}

Error Handling

ErrorBehavior
Missing fieldGroup as “null” or exclude
Too many groupsTruncate to max_groups
Empty resultsReturn empty groups array
Invalid field pathStage fails

Group By vs Cluster

AspectGroup ByCluster
Grouping basisField valueEmbedding similarity
Groups knownYes (field values)No (discovered)
SpeedFastSlower
Use caseCategory organizationTheme discovery