Skip to main content
LLM Enrichment stage showing structured data extraction with language models
The LLM Enrichment stage uses language models to extract structured data from document content. It can identify entities, classify content, extract key information, and generate structured outputs.
Stage Category: APPLY (Enriches documents)Transformation: N documents → N documents (with extracted data added)

When to Use

Use CaseDescription
Entity extractionExtract names, dates, amounts from text
Content classificationCategorize documents by topic/type
Key information extractionPull specific facts from unstructured text
Structured output generationConvert prose to structured data

When NOT to Use

ScenarioRecommended Alternative
Simple field transformationjson_transform
Predefined taxonomy classificationtaxonomy_enrich
Large-scale processingPre-process during indexing
Real-time low-latencyUse cached extractions

Parameters

ParameterTypeDefaultDescription
modelstringRequiredLLM model to use
promptstringRequiredExtraction instructions
content_fieldstringcontentField to analyze
output_fieldstringextractedField for extracted data
output_schemaobjectnullJSON schema for structured output
batch_sizeinteger5Documents per LLM call

Available Models

ModelSpeedQualityBest For
gpt-4o-miniFastGoodSimple extractions
gpt-4oMediumExcellentComplex analysis
claude-3-haikuFastGoodQuick processing
claude-3-sonnetMediumExcellentNuanced extraction

Configuration Examples

{
  "stage_type": "apply",
  "stage_id": "llm_enrichment",
  "parameters": {
    "model": "gpt-4o-mini",
    "prompt": "Extract all company names and person names mentioned in this document.",
    "output_field": "entities"
  }
}

Output Schema

Define structured output using JSON Schema:
{
  "output_schema": {
    "type": "object",
    "properties": {
      "field_name": {"type": "string"},
      "numeric_field": {"type": "number"},
      "boolean_field": {"type": "boolean"},
      "array_field": {"type": "array", "items": {"type": "string"}},
      "enum_field": {"type": "string", "enum": ["option1", "option2", "option3"]}
    },
    "required": ["field_name"]
  }
}

Supported Types

TypeDescription
stringText values
numberNumeric values
booleanTrue/false
arrayLists of items
objectNested objects

Output Examples

Without Schema

{
  "document_id": "doc_123",
  "content": "Apple Inc. announced...",
  "entities": "Companies: Apple Inc., Microsoft\nPeople: Tim Cook, Satya Nadella"
}

With Schema

{
  "document_id": "doc_123",
  "content": "Great product, 5 stars!...",
  "review_analysis": {
    "sentiment": "positive",
    "rating_mentioned": 5,
    "pros": ["easy to use", "great value", "fast shipping"],
    "cons": ["packaging could be better"],
    "would_recommend": true
  }
}

Performance

MetricValue
Latency300-800ms per batch
Batch size5 documents default
Token usage~200 tokens per document
ParallelBatches processed concurrently
LLM enrichment is expensive. Consider pre-computing extractions during indexing for frequently accessed data.

Common Pipeline Patterns

Search + Extract + Filter

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 20
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrichment",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Extract the main topic and sentiment.",
      "output_field": "analysis",
      "output_schema": {
        "type": "object",
        "properties": {
          "topic": {"type": "string"},
          "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]}
        }
      }
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "analysis.sentiment",
        "operator": "eq",
        "value": "positive"
      }
    }
  }
]

Entity Extraction Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 10
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "llm_enrichment",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Extract all entities with their types and relationships.",
      "output_field": "entities",
      "output_schema": {
        "type": "object",
        "properties": {
          "people": {"type": "array", "items": {"type": "object", "properties": {"name": {"type": "string"}, "role": {"type": "string"}}}},
          "organizations": {"type": "array", "items": {"type": "string"}},
          "locations": {"type": "array", "items": {"type": "string"}},
          "dates": {"type": "array", "items": {"type": "string"}}
        }
      }
    }
  }
]

Writing Effective Prompts

Good Prompts

✓ "Extract the product name, price, and key features from this product listing."
✓ "Identify all dates mentioned and their associated events."
✓ "Classify the sentiment as positive, neutral, or negative, and explain why."

Poor Prompts

✗ "Analyze this document" (too vague)
✗ "Get the data" (not specific)
✗ "Tell me about it" (unclear output)
Be specific about what to extract and in what format. When using output_schema, the LLM will conform to the structure.

Error Handling

ErrorBehavior
LLM timeoutRetry once, then null result
Schema validation failRaw text in output_field
Rate limitAutomatic backoff
Empty contentSkip enrichment