Feature Extractor Categories
Text Extractors
Dense/sparse embeddings, NER, summarization, sentiment analysis
Vision Extractors
Image embeddings (CLIP), object detection, OCR, scene analysis
Audio Extractors
Transcription (Whisper), speaker diarization, audio embeddings
Multimodal Extractors
Video scene detection, PDF layout analysis, document understanding
Decision Framework
1. Start with Your Query Intent
What will users search for?| Query Type | Required Extractors | Example | 
|---|---|---|
| Text search | text_extractor(dense + sparse) | “Find articles about machine learning” | 
| Image similarity | image_extractor(CLIP) | “Find similar product photos” | 
| Video moment search | video_extractor+audio_extractor | ”Find scenes where someone mentions pricing” | 
| Document QA | pdf_extractor+text_extractor | ”Which contracts have termination clauses?” | 
| Face search | face_extractor(ArcFace) | “Find all videos featuring this person” | 
2. Consider Modality Combinations
Single-Modality Search:collection_ids: ["col_text", "col_image"].
Model Selection Guide
Text Embeddings
| Model | Dimensions | Languages | Accuracy | Latency | Cost | Best For | 
|---|---|---|---|---|---|---|
| multilingual-e5-base | 768 | 100+ | Good | Fast | Low | High-volume, cost-sensitive | 
| multilingual-e5-large-instruct | 1024 | 100+ | Excellent | Medium | Medium | General-purpose semantic search | 
| bge-large-en-v1.5 | 1024 | English | Excellent | Medium | Medium | English-only, high accuracy | 
| openai/text-embedding-3-large | 3072 | 100+ | Best | Slow | High | Premium quality, multilingual | 
| cohere/embed-english-v3 | 1024 | English | Excellent | Medium | Medium | Domain adaptation | 
- Multilingual? → multilingual-e5-*oropenai/text-embedding-3-*
- Budget-constrained? → multilingual-e5-base
- English-only, high accuracy? → bge-large-en-v1.5
- Best possible quality? → openai/text-embedding-3-large
Vision Embeddings
| Model | Use Case | Accuracy | Latency | Cost | 
|---|---|---|---|---|
| clip-vit-base-patch32 | General image similarity | Good | Fast | Low | 
| clip-vit-large-patch14 | High-quality visual search | Excellent | Medium | Medium | 
| dinov2-large | Fine-grained object distinction | Excellent | Slow | High | 
- Product images? → clip-vit-large-patch14(handles varied backgrounds)
- Medical/satellite imagery? → dinov2-large(fine-grained features)
- High volume? → clip-vit-base-patch32(cost-effective)
Audio Transcription
| Model | Accuracy | Latency | Cost | Languages | 
|---|---|---|---|---|
| whisper-base | Moderate | Fast | Low | 100+ | 
| whisper-large-v3 | Excellent | Slow | High | 100+ | 
| azure-speech-to-text | Excellent | Medium | Medium | 85 | 
- Clear audio, cost-sensitive? → whisper-base
- Accents, noisy audio? → whisper-large-v3
- Real-time transcription? → azure-speech-to-text(streaming)
Feature Combinations
Scenario 1: E-Commerce Product Search
Objective: Search by text (“red shoes”) or image (upload photo) Collections:- 
Text collection – product descriptions
- Extractor: text_extractor@v1
- Model: multilingual-e5-large-instruct
- Enable sparse: true(for exact SKU matches)
 
- Extractor: 
- 
Image collection – product photos
- Extractor: image_extractor@v1
- Model: clip-vit-large-patch14
 
- Extractor: 
Scenario 2: Video Content Moderation
Objective: Detect inappropriate content in uploaded videos Collections:- 
Visual scenes – keyframe analysis
- Extractor: video_extractor@v1
- Parameters: scene_detection_threshold: 0.3
 
- Extractor: 
- 
Audio content – transcription for hate speech detection
- Extractor: audio_extractor@v1
- Model: whisper-large-v3(accuracy critical)
- Enable diarization: true
 
- Extractor: 
- 
On-screen text – detect inappropriate text overlays
- Extractor: text_extractor@v1
- Source: OCR from video_extractoroutputs
 
- Extractor: 
content-safety-taxonomy to flag:
- Violence
- Adult content
- Hate speech
- Graphic imagery
Scenario 3: Legal Contract Analysis
Objective: Extract clauses, entities, and find similar contracts Collections:- 
Full-text embeddings – semantic clause search
- Extractor: pdf_extractor@v1+text_extractor@v1
- Chunk strategy: paragraph(preserves clause structure)
- Model: bge-large-en-v1.5(English legal language)
 
- Extractor: 
- 
Entity extraction – parties, dates, amounts
- Extractor: text_extractor@v1
- Enable NER: true
- Entity types: ["PERSON", "ORG", "DATE", "MONEY", "GPE"]
 
- Extractor: 
- 
Table extraction – financial schedules, payment terms
- Extractor: table_extractor@v1
- Detection model: table-transformer
 
- Extractor: 
- Stage 1: Filter by entity (e.g., “Acme Corp”)
- Stage 2: Semantic search for clause type
- Stage 3: LLM generation for summarization
Feature-Specific Parameters
Text Chunking
| Content Type | Chunk Strategy | Chunk Size | Overlap | 
|---|---|---|---|
| Tweets/short posts | fixed | 256 tokens | 0 | 
| Blog articles | paragraph | 512 tokens | 50 | 
| Documentation | sentence | 256 tokens | 25 | 
| Legal contracts | paragraph | 1024 tokens | 100 | 
| Transcripts | time_window(60s) | Variable | 5s | 
Video Processing
OCR Configuration
Avoiding Over-Extraction
Anti-Pattern: Extract Everything
- High processing cost (6x extractors)
- Slow ingestion
- Storage bloat
- Most features unused
Best Practice: Extract Selectively
- You have a concrete query use case
- You’ve tested alternatives
- The cost/benefit justifies it
Testing & Validation
Offline Evaluation
Test extractor outputs before production:- Embedding quality (clustering, visualization)
- Metadata accuracy (NER entities, OCR text)
- Processing time
- Credit consumption
A/B Testing Models
Create parallel collections with different models:- Precision@10, Recall@10
- User click-through rate
- Latency
- Cost per query
Query Analysis
Identify which features contribute to results:Migration & Reprocessing
Adding New Extractors
- Create new collection with additional extractor
- Process recent objects into new collection
- Update retriever to query both collections
- Archive old collection after validation
Changing Models
- Create new collection with updated model
- Reprocess objects in batches
- Compare retrieval quality (A/B test)
- Switch retriever to new collection
- Delete old collection after transition period
Incremental Updates
For large archives, prioritize reprocessing:Best Practices
Start simple, add complexity iteratively
Start simple, add complexity iteratively
Begin with a single text or image extractor. Add multimodal extractors only when you have a clear use case and test data.
Match model size to use case
Match model size to use case
Use base models for high-volume, cost-sensitive workloads. Reserve large models for accuracy-critical applications.
Enable sparse features for exact matching
Enable sparse features for exact matching
Hybrid search (dense + sparse) handles both semantic and keyword queries. Enable 
enable_sparse: true for text extractors.Test with real user queries
Test with real user queries
Offline metrics (NDCG, MRR) don’t always correlate with user satisfaction. A/B test with live traffic.
Monitor feature utilization
Monitor feature utilization
Track which features contribute to top results. Disable unused extractors to reduce cost.
Version collections explicitly
Version collections explicitly
When changing extractors, create new collections rather than mutating existing ones. This preserves reproducibility.
Next Steps
- Explore Feature Extractors catalog
- Review Model Registry for available models
- Learn Cost Optimization for feature selection impact
- Check Schema Design for input mapping patterns

