The LLM Enrich stage uses language models to extract structured data from document content. It can identify entities, classify content, extract key information, and generate structured outputs.
Stage Category : APPLY (Enriches documents)Transformation : N documents → N documents (with extracted data added)
When to Use
Use Case Description Entity extraction Extract names, dates, amounts from text Content classification Categorize documents by topic/type Key information extraction Pull specific facts from unstructured text Structured output generation Convert prose to structured data
When NOT to Use
Scenario Recommended Alternative Simple field transformation json_transformPredefined taxonomy classification taxonomy_enrichLarge-scale processing Pre-process during indexing Real-time low-latency Use cached extractions
Parameters
Parameter Type Default Description modelstring Required LLM model to use promptstring Required Extraction instructions content_fieldstring contentField to analyze output_fieldstring extractedField for extracted data output_schemaobject nullJSON schema for structured output batch_sizeinteger 5Documents per LLM call
Available Models
Model Speed Quality Best For gpt-4o-miniFast Good Simple extractions gpt-4oMedium Excellent Complex analysis claude-3-haikuFast Good Quick processing claude-3-sonnetMedium Excellent Nuanced extraction
Configuration Examples
Basic Entity Extraction
Structured Output with Schema
Key Facts Extraction
Topic Classification
Contact Information Extraction
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract all company names and person names mentioned in this document." ,
"output_field" : "entities"
}
}
Output Schema
Define structured output using JSON Schema:
{
"output_schema" : {
"type" : "object" ,
"properties" : {
"field_name" : { "type" : "string" },
"numeric_field" : { "type" : "number" },
"boolean_field" : { "type" : "boolean" },
"array_field" : { "type" : "array" , "items" : { "type" : "string" }},
"enum_field" : { "type" : "string" , "enum" : [ "option1" , "option2" , "option3" ]}
},
"required" : [ "field_name" ]
}
}
Supported Types
Type Description stringText values numberNumeric values booleanTrue/false arrayLists of items objectNested objects
Output Examples
Without Schema
{
"document_id" : "doc_123" ,
"content" : "Apple Inc. announced..." ,
"entities" : "Companies: Apple Inc., Microsoft \n People: Tim Cook, Satya Nadella"
}
With Schema
{
"document_id" : "doc_123" ,
"content" : "Great product, 5 stars!..." ,
"review_analysis" : {
"sentiment" : "positive" ,
"rating_mentioned" : 5 ,
"pros" : [ "easy to use" , "great value" , "fast shipping" ],
"cons" : [ "packaging could be better" ],
"would_recommend" : true
}
}
Metric Value Latency 300-800ms per batch Batch size 5 documents default Token usage ~200 tokens per document Parallel Batches processed concurrently
LLM enrichment is expensive. Consider pre-computing extractions during indexing for frequently accessed data.
Common Pipeline Patterns
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{
"feature_uri" : "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding" ,
"query" : "{{INPUT.query}}" ,
"top_k" : 20
}
],
"final_top_k" : 20
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract the main topic and sentiment." ,
"output_field" : "analysis" ,
"output_schema" : {
"type" : "object" ,
"properties" : {
"topic" : { "type" : "string" },
"sentiment" : { "type" : "string" , "enum" : [ "positive" , "neutral" , "negative" ]}
}
}
}
},
{
"stage_type" : "filter" ,
"stage_id" : "attribute_filter" ,
"parameters" : {
"field" : "analysis.sentiment" ,
"operator" : "eq" ,
"value" : "positive"
}
}
]
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"searches" : [
{
"feature_uri" : "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding" ,
"query" : "{{INPUT.query}}" ,
"top_k" : 10
}
],
"final_top_k" : 10
}
},
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"model" : "gpt-4o" ,
"prompt" : "Extract all entities with their types and relationships." ,
"output_field" : "entities" ,
"output_schema" : {
"type" : "object" ,
"properties" : {
"people" : { "type" : "array" , "items" : { "type" : "object" , "properties" : { "name" : { "type" : "string" }, "role" : { "type" : "string" }}}},
"organizations" : { "type" : "array" , "items" : { "type" : "string" }},
"locations" : { "type" : "array" , "items" : { "type" : "string" }},
"dates" : { "type" : "array" , "items" : { "type" : "string" }}
}
}
}
}
]
Writing Effective Prompts
Good Prompts
✓ "Extract the product name, price, and key features from this product listing."
✓ "Identify all dates mentioned and their associated events."
✓ "Classify the sentiment as positive, neutral, or negative, and explain why."
Poor Prompts
✗ "Analyze this document" (too vague)
✗ "Get the data" (not specific)
✗ "Tell me about it" (unclear output)
Be specific about what to extract and in what format. When using output_schema, the LLM will conform to the structure.
Bring Your Own Key (BYOK)
Use your own LLM API keys instead of Mixpeek’s default keys. This gives you control over costs, rate limits, and API usage.
Why Use BYOK?
Benefit Description Cost Control Use your own API credits and billing Rate Limits Use your own rate limits instead of shared Compliance Keep API calls under your own account Key Rotation Rotate keys without changing retrievers
Setup
Store your API key as a secret
Store your LLM provider API key in the organization secrets vault: curl -X POST "https://api.mixpeek.com/v1/organizations/secrets" \
-H "Authorization: Bearer YOUR_MIXPEEK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"secret_name": "openai_api_key",
"secret_value": "sk-proj-abc123..."
}'
Reference the secret in your stage
Use the api_key parameter with template syntax: {
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"prompt" : "Extract key entities from this document." ,
"output_field" : "entities" ,
"api_key" : "{{secrets.openai_api_key}}"
}
}
BYOK Configuration Example
OpenAI BYOK
Anthropic BYOK
Google BYOK
{
"stage_type" : "apply" ,
"stage_id" : "llm_enrich" ,
"parameters" : {
"provider" : "openai" ,
"model" : "gpt-4o-mini" ,
"prompt" : "Summarize this document in 2-3 sentences." ,
"output_field" : "summary" ,
"api_key" : "{{secrets.openai_api_key}}"
}
}
Supported Providers
Provider Secret Name Example Models OpenAI openai_api_keygpt-4o, gpt-4o-mini Anthropic anthropic_api_keyclaude-3-haiku, claude-3-sonnet, claude-3-opus Google google_api_keygemini-2.0-flash, gemini-1.5-pro
When api_key is not specified, the stage uses Mixpeek’s default API keys and usage is charged to your Mixpeek account.
Error Handling
Error Behavior LLM timeout Retry once, then null result Schema validation fail Raw text in output_field Rate limit Automatic backoff Empty content Skip enrichment Invalid API key Error returned with auth failure