The LLM Filter stage uses language models to evaluate document content against specified criteria, filtering based on semantic understanding rather than metadata fields.
Stage Category : FILTER (Reduces document set)Transformation : N documents → M documents (where M ≤ N, based on LLM evaluation)
When to Use
Use Case Description Content quality filtering Remove low-quality or irrelevant content Semantic criteria Filter by meaning, not just keywords Complex requirements ”Only technical documentation” Subjective evaluation Tone, style, or sentiment filtering
When NOT to Use
Scenario Recommended Alternative Simple metadata filtering structured_filter (faster)Large result sets (100+) Too slow, pre-filter first Deterministic rules structured_filterLow latency requirements Use metadata filters
Parameters
Parameter Type Default Description modelstring Required LLM model to use criteriastring Required Natural language filter criteria content_fieldstring contentField to evaluate explanationboolean falseInclude filtering explanation batch_sizeinteger 10Documents per LLM call
Available Models
Model Speed Quality Cost gpt-4o-miniFast Good Low gpt-4oMedium Excellent Medium claude-3-haikuFast Good Low claude-3-sonnetMedium Excellent Medium
Configuration Examples
Basic Content Filter
Quality Filter
Topic Relevance
Professional Tone Filter
Factual Content Only
{
"stage_type" : "filter" ,
"stage_id" : "llm_filter" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"criteria" : "Keep only documents that contain technical information about software development"
}
}
Writing Effective Criteria
Good Criteria Examples
✓ "Keep documents about machine learning algorithms and their implementations"
✓ "Filter out content that is primarily promotional or marketing material"
✓ "Include only documents written in the last 5 years about cloud computing"
✓ "Keep technical documentation; remove blog posts and news articles"
Poor Criteria Examples
✗ "Good documents" (too vague)
✗ "Relevant content" (not specific)
✗ "High quality" (subjective without definition)
Be specific about what to include AND exclude. The LLM makes a binary keep/discard decision for each document.
Output Schema
Without Explanation
Documents that pass the filter are returned unchanged:
{
"document_id" : "doc_123" ,
"content" : "Technical documentation about..." ,
"metadata" : { ... }
}
With Explanation
{
"document_id" : "doc_123" ,
"content" : "Technical documentation about..." ,
"metadata" : { ... },
"llm_filter" : {
"passed" : true ,
"explanation" : "Document contains detailed technical information about API implementation."
}
}
Filtered Out (not in results)
Documents that don’t match criteria are removed from the result set.
Metric Value Latency 200-500ms per batch Batch size 10 documents default Token usage ~100 tokens per document Parallel Batches processed concurrently
LLM filtering is expensive and slow. Always apply structured_filter or use search top_k limits to reduce the document set before LLM filtering.
Common Pipeline Patterns
[
{
"stage_type" : "filter" ,
"stage_id" : "semantic_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 50
}
},
{
"stage_type" : "filter" ,
"stage_id" : "structured_filter" ,
"parameters" : {
"conditions" : {
"field" : "metadata.type" ,
"operator" : "eq" ,
"value" : "documentation"
}
}
},
{
"stage_type" : "filter" ,
"stage_id" : "llm_filter" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"criteria" : "Keep only documents that provide actionable, step-by-step instructions"
}
},
{
"stage_type" : "reduce" ,
"stage_id" : "limit" ,
"parameters" : {
"limit" : 5
}
}
]
Quality + Relevance Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "hybrid_search" ,
"parameters" : {
"query" : "{{INPUT.query}}" ,
"vector_index" : "text_extractor_v1_embedding" ,
"top_k" : 30
}
},
{
"stage_type" : "sort" ,
"stage_id" : "rerank" ,
"parameters" : {
"model" : "bge-reranker-v2-m3" ,
"top_n" : 15
}
},
{
"stage_type" : "filter" ,
"stage_id" : "llm_filter" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"criteria" : "Keep only high-quality, authoritative sources. Remove: user-generated content without verification, outdated information (pre-2020), and incomplete documents."
}
}
]
Cost Optimization
Strategy Impact Pre-filter with metadata Reduce documents before LLM Use cheaper models gpt-4o-mini vs gpt-4oIncrease batch size Fewer API calls Limit input documents Use top_k in search
Bring Your Own Key (BYOK)
Use your own LLM API keys instead of Mixpeek’s default keys. This gives you control over costs, rate limits, and API usage.
Why Use BYOK?
Benefit Description Cost Control Use your own API credits and billing Rate Limits Use your own rate limits instead of shared Compliance Keep API calls under your own account Key Rotation Rotate keys without changing retrievers
Setup
Store your API key as a secret
Store your LLM provider API key in the organization secrets vault: curl -X POST "https://api.mixpeek.com/v1/organizations/secrets" \
-H "Authorization: Bearer YOUR_MIXPEEK_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"secret_name": "openai_api_key",
"secret_value": "sk-proj-abc123..."
}'
Reference the secret in your stage
Use the api_key parameter with template syntax: {
"stage_type" : "filter" ,
"stage_id" : "llm_filter" ,
"parameters" : {
"model" : "gpt-4o-mini" ,
"criteria" : "Keep only technical documentation" ,
"api_key" : "{{secrets.openai_api_key}}"
}
}
BYOK Configuration Example
OpenAI BYOK
Anthropic BYOK
Google BYOK
{
"stage_type" : "filter" ,
"stage_id" : "llm_filter" ,
"parameters" : {
"provider" : "openai" ,
"model" : "gpt-4o-mini" ,
"criteria" : "Keep only high-quality, professional content" ,
"api_key" : "{{secrets.openai_api_key}}"
}
}
Supported Providers
Provider Secret Name Example Models OpenAI openai_api_keygpt-4o, gpt-4o-mini Anthropic anthropic_api_keyclaude-3-haiku, claude-3-sonnet, claude-3-opus Google google_api_keygemini-2.0-flash, gemini-1.5-pro
When api_key is not specified, the stage uses Mixpeek’s default API keys and usage is charged to your Mixpeek account.
Error Handling
Error Behavior LLM timeout Retry once, then fail Rate limit Automatic backoff Invalid model Stage fails Empty criteria All documents pass Invalid API key Error returned with auth failure