Skip to main content
LLM Filter stage showing content-based filtering with language models
The LLM Filter stage uses language models to evaluate document content against specified criteria, filtering based on semantic understanding rather than metadata fields.
Stage Category: FILTER (Reduces document set)Transformation: N documents → M documents (where M ≤ N, based on LLM evaluation)

When to Use

Use CaseDescription
Content quality filteringRemove low-quality or irrelevant content
Semantic criteriaFilter by meaning, not just keywords
Complex requirements”Only technical documentation”
Subjective evaluationTone, style, or sentiment filtering

When NOT to Use

ScenarioRecommended Alternative
Simple metadata filteringstructured_filter (faster)
Large result sets (100+)Too slow, pre-filter first
Deterministic rulesstructured_filter
Low latency requirementsUse metadata filters

Parameters

ParameterTypeDefaultDescription
modelstringRequiredLLM model to use
criteriastringRequiredNatural language filter criteria
content_fieldstringcontentField to evaluate
explanationbooleanfalseInclude filtering explanation
batch_sizeinteger10Documents per LLM call

Available Models

ModelSpeedQualityCost
gpt-4o-miniFastGoodLow
gpt-4oMediumExcellentMedium
claude-3-haikuFastGoodLow
claude-3-sonnetMediumExcellentMedium

Configuration Examples

{
  "stage_type": "filter",
  "stage_id": "llm_filter",
  "parameters": {
    "model": "gpt-4o-mini",
    "criteria": "Keep only documents that contain technical information about software development"
  }
}

Writing Effective Criteria

Good Criteria Examples

✓ "Keep documents about machine learning algorithms and their implementations"
✓ "Filter out content that is primarily promotional or marketing material"
✓ "Include only documents written in the last 5 years about cloud computing"
✓ "Keep technical documentation; remove blog posts and news articles"

Poor Criteria Examples

✗ "Good documents" (too vague)
✗ "Relevant content" (not specific)
✗ "High quality" (subjective without definition)
Be specific about what to include AND exclude. The LLM makes a binary keep/discard decision for each document.

Output Schema

Without Explanation

Documents that pass the filter are returned unchanged:
{
  "document_id": "doc_123",
  "content": "Technical documentation about...",
  "metadata": {...}
}

With Explanation

{
  "document_id": "doc_123",
  "content": "Technical documentation about...",
  "metadata": {...},
  "llm_filter": {
    "passed": true,
    "explanation": "Document contains detailed technical information about API implementation."
  }
}

Filtered Out (not in results)

Documents that don’t match criteria are removed from the result set.

Performance

MetricValue
Latency200-500ms per batch
Batch size10 documents default
Token usage~100 tokens per document
ParallelBatches processed concurrently
LLM filtering is expensive and slow. Always apply structured_filter or use search top_k limits to reduce the document set before LLM filtering.

Common Pipeline Patterns

Search + Metadata Filter + LLM Filter

[
  {
    "stage_type": "filter",
    "stage_id": "semantic_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 50
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "structured_filter",
    "parameters": {
      "conditions": {
        "field": "metadata.type",
        "operator": "eq",
        "value": "documentation"
      }
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "llm_filter",
    "parameters": {
      "model": "gpt-4o-mini",
      "criteria": "Keep only documents that provide actionable, step-by-step instructions"
    }
  },
  {
    "stage_type": "reduce",
    "stage_id": "limit",
    "parameters": {
      "limit": 5
    }
  }
]

Quality + Relevance Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "hybrid_search",
    "parameters": {
      "query": "{{INPUT.query}}",
      "vector_index": "text_extractor_v1_embedding",
      "top_k": 30
    }
  },
  {
    "stage_type": "sort",
    "stage_id": "rerank",
    "parameters": {
      "model": "bge-reranker-v2-m3",
      "top_n": 15
    }
  },
  {
    "stage_type": "filter",
    "stage_id": "llm_filter",
    "parameters": {
      "model": "gpt-4o-mini",
      "criteria": "Keep only high-quality, authoritative sources. Remove: user-generated content without verification, outdated information (pre-2020), and incomplete documents."
    }
  }
]

Cost Optimization

StrategyImpact
Pre-filter with metadataReduce documents before LLM
Use cheaper modelsgpt-4o-mini vs gpt-4o
Increase batch sizeFewer API calls
Limit input documentsUse top_k in search

Bring Your Own Key (BYOK)

Use your own LLM API keys instead of Mixpeek’s default keys. This gives you control over costs, rate limits, and API usage.

Why Use BYOK?

BenefitDescription
Cost ControlUse your own API credits and billing
Rate LimitsUse your own rate limits instead of shared
ComplianceKeep API calls under your own account
Key RotationRotate keys without changing retrievers

Setup

1

Store your API key as a secret

Store your LLM provider API key in the organization secrets vault:
curl -X POST "https://api.mixpeek.com/v1/organizations/secrets" \
  -H "Authorization: Bearer YOUR_MIXPEEK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "secret_name": "openai_api_key",
    "secret_value": "sk-proj-abc123..."
  }'
2

Reference the secret in your stage

Use the api_key parameter with template syntax:
{
  "stage_type": "filter",
  "stage_id": "llm_filter",
  "parameters": {
    "model": "gpt-4o-mini",
    "criteria": "Keep only technical documentation",
    "api_key": "{{secrets.openai_api_key}}"
  }
}

BYOK Configuration Example

{
  "stage_type": "filter",
  "stage_id": "llm_filter",
  "parameters": {
    "provider": "openai",
    "model": "gpt-4o-mini",
    "criteria": "Keep only high-quality, professional content",
    "api_key": "{{secrets.openai_api_key}}"
  }
}

Supported Providers

ProviderSecret Name ExampleModels
OpenAIopenai_api_keygpt-4o, gpt-4o-mini
Anthropicanthropic_api_keyclaude-3-haiku, claude-3-sonnet, claude-3-opus
Googlegoogle_api_keygemini-2.0-flash, gemini-1.5-pro
When api_key is not specified, the stage uses Mixpeek’s default API keys and usage is charged to your Mixpeek account.

Error Handling

ErrorBehavior
LLM timeoutRetry once, then fail
Rate limitAutomatic backoff
Invalid modelStage fails
Empty criteriaAll documents pass
Invalid API keyError returned with auth failure