LLM Enrich

LLM Enrich stage showing structured data extraction with language models

The LLM Enrich stage uses language models to extract structured data from document content. It can identify entities, classify content, extract key information, and generate structured outputs.

Stage Category: ENRICH (Enriches documents)Transformation: N documents → N documents (with extracted data added)

When to Use

Use Case	Description
Entity extraction	Extract names, dates, amounts from text
Content classification	Categorize documents by topic/type
Key information extraction	Pull specific facts from unstructured text
Structured output generation	Convert prose to structured data

When NOT to Use

Scenario	Recommended Alternative
Simple field transformation	`json_transform`
Predefined taxonomy classification	`taxonomy_enrich`
Large-scale processing	Pre-process during indexing
Real-time low-latency	Use cached extractions

Parameters

Parameter	Type	Default	Description
`model`	string	Required	LLM model to use
`prompt`	string	Required	Extraction instructions
`content_field`	string	`content`	Field to analyze
`output_field`	string	`extracted`	Field for extracted data
`output_schema`	object	`null`	JSON schema for structured output
`batch_size`	integer	`5`	Documents per LLM call

Available Models

Model	Speed	Quality	Best For
`gpt-4o-mini`	Fast	Good	Simple extractions
`gpt-4o`	Medium	Excellent	Complex analysis
`claude-3-haiku`	Fast	Good	Quick processing
`claude-3-sonnet`	Medium	Excellent	Nuanced extraction

Configuration Examples

{
  "stage_type": "enrich",
  "stage_id": "llm_enrich",
  "parameters": {
    "model": "gpt-4o-mini",
    "prompt": "Extract all company names and person names mentioned in this document.",
    "output_field": "entities"
  }
}

Output Schema

Define structured output using JSON Schema:

{
  "output_schema": {
    "type": "object",
    "properties": {
      "field_name": {"type": "string"},
      "numeric_field": {"type": "number"},
      "boolean_field": {"type": "boolean"},
      "array_field": {"type": "array", "items": {"type": "string"}},
      "enum_field": {"type": "string", "enum": ["option1", "option2", "option3"]}
    },
    "required": ["field_name"]
  }
}

Supported Types

Type	Description
`string`	Text values
`number`	Numeric values
`boolean`	True/false
`array`	Lists of items
`object`	Nested objects

Output Examples

Without Schema

{
  "document_id": "doc_123",
  "content": "Apple Inc. announced...",
  "entities": "Companies: Apple Inc., Microsoft\nPeople: Tim Cook, Satya Nadella"
}

With Schema

{
  "document_id": "doc_123",
  "content": "Great product, 5 stars!...",
  "review_analysis": {
    "sentiment": "positive",
    "rating_mentioned": 5,
    "pros": ["easy to use", "great value", "fast shipping"],
    "cons": ["packaging could be better"],
    "would_recommend": true
  }
}

Performance

Metric	Value
Latency	300-800ms per batch
Batch size	5 documents default
Token usage	~200 tokens per document
Parallel	Batches processed concurrently

LLM enrichment is expensive. Consider pre-computing extractions during indexing for frequently accessed data.

Common Pipeline Patterns

Search + Extract + Filter

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 20
        }
      ],
      "final_top_k": 20
    }
  },
  {
    "stage_type": "enrich",
    "stage_id": "llm_enrich",
    "parameters": {
      "model": "gpt-4o-mini",
      "prompt": "Extract the main topic and sentiment.",
      "output_field": "analysis",
      "output_schema": {
        "type": "object",
        "properties": {
          "topic": {"type": "string"},
          "sentiment": {"type": "string", "enum": ["positive", "neutral", "negative"]}
        }
      }
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "attribute_filter",
    "parameters": {
      "field": "analysis.sentiment",
      "operator": "eq",
      "value": "positive"
    }
  }
]

Entity Extraction Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "searches": [
        {
          "feature_uri": "mixpeek://multimodal_extractor@v1/vertex_multimodal_embedding",
          "query": "{{INPUT.query}}",
          "top_k": 10
        }
      ],
      "final_top_k": 10
    }
  },
  {
    "stage_type": "enrich",
    "stage_id": "llm_enrich",
    "parameters": {
      "model": "gpt-4o",
      "prompt": "Extract all entities with their types and relationships.",
      "output_field": "entities",
      "output_schema": {
        "type": "object",
        "properties": {
          "people": {"type": "array", "items": {"type": "object", "properties": {"name": {"type": "string"}, "role": {"type": "string"}}}},
          "organizations": {"type": "array", "items": {"type": "string"}},
          "locations": {"type": "array", "items": {"type": "string"}},
          "dates": {"type": "array", "items": {"type": "string"}}
        }
      }
    }
  }
]

Writing Effective Prompts

Good Prompts

✓ "Extract the product name, price, and key features from this product listing."
✓ "Identify all dates mentioned and their associated events."
✓ "Classify the sentiment as positive, neutral, or negative, and explain why."

Poor Prompts

✗ "Analyze this document" (too vague)
✗ "Get the data" (not specific)
✗ "Tell me about it" (unclear output)

Be specific about what to extract and in what format. When using output_schema, the LLM will conform to the structure.

Bring Your Own Key (BYOK)

Use your own LLM API keys instead of Mixpeek’s default keys. This gives you control over costs, rate limits, and API usage.

Why Use BYOK?

Benefit	Description
Cost Control	Use your own API credits and billing
Rate Limits	Use your own rate limits instead of shared
Compliance	Keep API calls under your own account
Key Rotation	Rotate keys without changing retrievers

Setup

Store your API key as a secret

Store your LLM provider API key in the organization secrets vault:

curl -X POST "https://api.mixpeek.com/v1/organizations/secrets" \
  -H "Authorization: Bearer YOUR_MIXPEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "secret_name": "openai_api_key",
    "secret_value": "sk-proj-abc123..."
  }'

Reference the secret in your stage

Use the api_key parameter with template syntax:

{
  "stage_type": "enrich",
  "stage_id": "llm_enrich",
  "parameters": {
    "model": "gpt-4o-mini",
    "prompt": "Extract key entities from this document.",
    "output_field": "entities",
    "api_key": "{{secrets.openai_api_key}}"
  }
}

BYOK Configuration Example

{
  "stage_type": "enrich",
  "stage_id": "llm_enrich",
  "parameters": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "prompt": "Summarize this document in 2-3 sentences.",
    "output_field": "summary",
    "api_key": "{{secrets.openai_api_key}}"
  }
}

Supported Providers

Provider	Secret Name Example	Models
OpenAI	`openai_api_key`	gpt-4o, gpt-4o-mini
Anthropic	`anthropic_api_key`	claude-3-haiku, claude-3-sonnet, claude-3-opus
Google	`google_api_key`	gemini-2.0-flash, gemini-1.5-pro

When api_key is not specified, the stage uses Mixpeek’s default API keys and usage is charged to your Mixpeek account.

Error Handling

Error	Behavior
LLM timeout	Retry once, then null result
Schema validation fail	Raw text in output_field
Rate limit	Automatic backoff
Empty content	Skip enrichment
Invalid API key	Error returned with auth failure

LLM Filter - Filter using LLM evaluation
Taxonomy Enrich - Predefined classification
JSON Transform - Template-based transformation

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

When to Use

When NOT to Use

Parameters

Available Models

Configuration Examples

Output Schema

Supported Types

Output Examples

Without Schema

With Schema

Performance

Common Pipeline Patterns

Search + Extract + Filter

Entity Extraction Pipeline

Writing Effective Prompts

Good Prompts

Poor Prompts

Bring Your Own Key (BYOK)

Why Use BYOK?

Setup

BYOK Configuration Example

Supported Providers

Error Handling

Getting Started

Ingest Data

Process Data

Search & Retrieve

Relevance & Personalization

Enrich & Organize

Operate in Production

Best Practices

Troubleshoot

​When to Use

​When NOT to Use

​Parameters

​Available Models

​Configuration Examples

​Output Schema

​Supported Types

​Output Examples

​Without Schema

​With Schema

​Performance

​Common Pipeline Patterns

​Search + Extract + Filter

​Entity Extraction Pipeline

​Writing Effective Prompts

​Good Prompts

​Poor Prompts

​Bring Your Own Key (BYOK)

​Why Use BYOK?

​Setup

​BYOK Configuration Example

​Supported Providers

​Error Handling

​Related

When to Use

When NOT to Use

Parameters

Available Models

Configuration Examples

Output Schema

Supported Types

Output Examples

Without Schema

With Schema

Performance

Common Pipeline Patterns

Search + Extract + Filter

Entity Extraction Pipeline

Writing Effective Prompts

Good Prompts

Poor Prompts

Bring Your Own Key (BYOK)

Why Use BYOK?

Setup

BYOK Configuration Example

Supported Providers

Error Handling

Related