The LLM Enrichment stage uses language models to extract structured data from document content. It can identify entities, classify content, extract key information, and generate structured outputs.
Stage Category : APPLY (Enriches documents)Transformation : N documents → N documents (with extracted data added)
When to Use
Use Case Description Entity extraction Extract names, dates, amounts from text Content classification Categorize documents by topic/type Key information extraction Pull specific facts from unstructured text Structured output generation Convert prose to structured data
When NOT to Use
Scenario Recommended Alternative Simple field transformation json_transformPredefined taxonomy classification taxonomy_enrichLarge-scale processing Pre-process during indexing Real-time low-latency Use cached extractions
Parameters
Parameter Type Default Description modelstring Required LLM model to use promptstring Required Extraction instructions content_fieldstring contentField to analyze output_fieldstring extractedField for extracted data output_schemaobject nullJSON schema for structured output batch_sizeinteger 5Documents per LLM call
Available Models
Model Speed Quality Best For gpt-4o-miniFast Good Simple extractions gpt-4oMedium Excellent Complex analysis claude-3-haikuFast Good Quick processing claude-3-sonnetMedium Excellent Nuanced extraction
Configuration Examples
Basic Entity Extraction
Structured Output with Schema
Key Facts Extraction
Topic Classification
Contact Information Extraction
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrichment" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract all company names and person names mentioned in this document." ,
"output_field" : "entities"
}
}
Output Schema
Define structured output using JSON Schema:
{
"output_schema" : {
"type" : "object" ,
"properties" : {
"field_name" : { "type" : "string" },
"numeric_field" : { "type" : "number" },
"boolean_field" : { "type" : "boolean" },
"array_field" : { "type" : "array" , "items" : { "type" : "string" }},
"enum_field" : { "type" : "string" , "enum" : [ "option1" , "option2" , "option3" ]}
},
"required" : [ "field_name" ]
}
}
Supported Types
Type Description stringText values numberNumeric values booleanTrue/false arrayLists of items objectNested objects
Output Examples
Without Schema
{
"document_id" : "doc_123" ,
"content" : "Apple Inc. announced..." ,
"entities" : "Companies: Apple Inc., Microsoft \n People: Tim Cook, Satya Nadella"
}
With Schema
{
"document_id" : "doc_123" ,
"content" : "Great product, 5 stars!..." ,
"review_analysis" : {
"sentiment" : "positive" ,
"rating_mentioned" : 5 ,
"pros" : [ "easy to use" , "great value" , "fast shipping" ],
"cons" : [ "packaging could be better" ],
"would_recommend" : true
}
}
Metric Value Latency 300-800ms per batch Batch size 5 documents default Token usage ~200 tokens per document Parallel Batches processed concurrently
LLM enrichment is expensive. Consider pre-computing extractions during indexing for frequently accessed data.
Common Pipeline Patterns
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 20
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrichment" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract the main topic and sentiment." ,
"output_field" : "analysis" ,
"output_schema" : {
"type" : "object" ,
"properties" : {
"topic" : { "type" : "string" },
"sentiment" : { "type" : "string" , "enum" : [ "positive" , "neutral" , "negative" ]}
}
}
}
},
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "analysis.sentiment" ,
"operator" : "eq" ,
"value" : "positive"
}
}
}
]
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 10
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrichment" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Extract all entities with their types and relationships." ,
"output_field" : "entities" ,
"output_schema" : {
"type" : "object" ,
"properties" : {
"people" : { "type" : "array" , "items" : { "type" : "object" , "properties" : { "name" : { "type" : "string" }, "role" : { "type" : "string" }}}},
"organizations" : { "type" : "array" , "items" : { "type" : "string" }},
"locations" : { "type" : "array" , "items" : { "type" : "string" }},
"dates" : { "type" : "array" , "items" : { "type" : "string" }}
}
}
}
}
]
Writing Effective Prompts
Good Prompts
✓ "Extract the product name, price, and key features from this product listing."
✓ "Identify all dates mentioned and their associated events."
✓ "Classify the sentiment as positive, neutral, or negative, and explain why."
Poor Prompts
✗ "Analyze this document" (too vague)
✗ "Get the data" (not specific)
✗ "Tell me about it" (unclear output)
Be specific about what to extract and in what format. When using output_schema, the LLM will conform to the structure.
Error Handling
Error Behavior LLM timeout Retry once, then null result Schema validation fail Raw text in output_field Rate limit Automatic backoff Empty content Skip enrichment