Skip to main content
Feature extractors transform raw objects into queryable embeddings and structured outputs. Selecting the right combination of extractors, models, and parameters is critical for accuracy, performance, and cost.

Feature Extractor Categories

Text Extractors

Dense/sparse embeddings, NER, summarization, sentiment analysis

Vision Extractors

Image embeddings (CLIP), object detection, OCR, scene analysis

Audio Extractors

Transcription (Whisper), speaker diarization, audio embeddings

Multimodal Extractors

Video scene detection, PDF layout analysis, document understanding

Decision Framework

1. Start with Your Query Intent

What will users search for?
Query TypeRequired ExtractorsExample
Text searchtext_extractor (dense + sparse)“Find articles about machine learning”
Image similarityimage_extractor (CLIP)“Find similar product photos”
Video moment searchvideo_extractor + audio_extractor”Find scenes where someone mentions pricing”
Document QApdf_extractor + text_extractor”Which contracts have termination clauses?”
Face searchface_extractor (ArcFace)“Find all videos featuring this person”

2. Consider Modality Combinations

Single-Modality Search:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "input_mappings": { "text": "content" }
  }
}
Hybrid Text Search (Dense + Sparse):
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "parameters": {
      "enable_sparse": true,  // Adds BM25 features
      "model": "multilingual-e5-large-instruct"
    }
  }
}
Multi-Modal Search (Text + Image):
// Collection 1: Text embeddings
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "input_mappings": { "text": "description" }
  }
}

// Collection 2: Image embeddings
{
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "input_mappings": { "image_url": "product_image" }
  }
}
Retriever queries both collections via collection_ids: ["col_text", "col_image"].

Model Selection Guide

Text Embeddings

ModelDimensionsLanguagesAccuracyLatencyCostBest For
multilingual-e5-base768100+GoodFastLowHigh-volume, cost-sensitive
multilingual-e5-large-instruct1024100+ExcellentMediumMediumGeneral-purpose semantic search
bge-large-en-v1.51024EnglishExcellentMediumMediumEnglish-only, high accuracy
openai/text-embedding-3-large3072100+BestSlowHighPremium quality, multilingual
cohere/embed-english-v31024EnglishExcellentMediumMediumDomain adaptation
Selection criteria:
  • Multilingual?multilingual-e5-* or openai/text-embedding-3-*
  • Budget-constrained?multilingual-e5-base
  • English-only, high accuracy?bge-large-en-v1.5
  • Best possible quality?openai/text-embedding-3-large

Vision Embeddings

ModelUse CaseAccuracyLatencyCost
clip-vit-base-patch32General image similarityGoodFastLow
clip-vit-large-patch14High-quality visual searchExcellentMediumMedium
dinov2-largeFine-grained object distinctionExcellentSlowHigh
Selection criteria:
  • Product images?clip-vit-large-patch14 (handles varied backgrounds)
  • Medical/satellite imagery?dinov2-large (fine-grained features)
  • High volume?clip-vit-base-patch32 (cost-effective)

Audio Transcription

ModelAccuracyLatencyCostLanguages
whisper-baseModerateFastLow100+
whisper-large-v3ExcellentSlowHigh100+
azure-speech-to-textExcellentMediumMedium85
Selection criteria:
  • Clear audio, cost-sensitive?whisper-base
  • Accents, noisy audio?whisper-large-v3
  • Real-time transcription?azure-speech-to-text (streaming)

Feature Combinations

Objective: Search by text (“red shoes”) or image (upload photo) Collections:
  1. Text collection – product descriptions
    • Extractor: text_extractor@v1
    • Model: multilingual-e5-large-instruct
    • Enable sparse: true (for exact SKU matches)
  2. Image collection – product photos
    • Extractor: image_extractor@v1
    • Model: clip-vit-large-patch14
Retriever:
{
  "stages": [
    {
      "stage_name": "hybrid_search",
      "parameters": {
        "queries": [
          { "feature_address": "mixpeek://text_extractor@v1/text_embedding", "weight": 0.5 },
          { "feature_address": "mixpeek://image_extractor@v1/clip_embedding", "weight": 0.5 }
        ]
      }
    }
  ]
}

Scenario 2: Video Content Moderation

Objective: Detect inappropriate content in uploaded videos Collections:
  1. Visual scenes – keyframe analysis
    • Extractor: video_extractor@v1
    • Parameters: scene_detection_threshold: 0.3
  2. Audio content – transcription for hate speech detection
    • Extractor: audio_extractor@v1
    • Model: whisper-large-v3 (accuracy critical)
    • Enable diarization: true
  3. On-screen text – detect inappropriate text overlays
    • Extractor: text_extractor@v1
    • Source: OCR from video_extractor outputs
Taxonomy: Apply content-safety-taxonomy to flag:
  • Violence
  • Adult content
  • Hate speech
  • Graphic imagery
Objective: Extract clauses, entities, and find similar contracts Collections:
  1. Full-text embeddings – semantic clause search
    • Extractor: pdf_extractor@v1 + text_extractor@v1
    • Chunk strategy: paragraph (preserves clause structure)
    • Model: bge-large-en-v1.5 (English legal language)
  2. Entity extraction – parties, dates, amounts
    • Extractor: text_extractor@v1
    • Enable NER: true
    • Entity types: ["PERSON", "ORG", "DATE", "MONEY", "GPE"]
  3. Table extraction – financial schedules, payment terms
    • Extractor: table_extractor@v1
    • Detection model: table-transformer
Retriever:
  • Stage 1: Filter by entity (e.g., “Acme Corp”)
  • Stage 2: Semantic search for clause type
  • Stage 3: LLM generation for summarization

Feature-Specific Parameters

Text Chunking

Content TypeChunk StrategyChunk SizeOverlap
Tweets/short postsfixed256 tokens0
Blog articlesparagraph512 tokens50
Documentationsentence256 tokens25
Legal contractsparagraph1024 tokens100
Transcriptstime_window (60s)Variable5s

Video Processing

{
  "parameters": {
    "scene_detection_threshold": 0.3,  // Lower = more scenes
    "keyframe_interval": 30,  // Extract 1 frame per 30 frames
    "max_scenes": 100,  // Cap for very long videos
    "extract_audio": true,  // Separate audio for transcription
    "resolution": "720p"  // Lower = faster, cheaper
  }
}

OCR Configuration

{
  "parameters": {
    "ocr_model": "tesseract-v5",  // Fast, moderate accuracy
    // "ocr_model": "cloud-vision",  // Slower, high accuracy
    "languages": ["en", "es"],
    "deskew": true,  // Straighten rotated text
    "denoise": true  // Improve scanned document quality
  }
}

Avoiding Over-Extraction

Anti-Pattern: Extract Everything

{
  "feature_extractors": [
    "text_extractor",
    "image_extractor",
    "video_extractor",
    "audio_extractor",
    "face_extractor",
    "table_extractor"
  ]
}
Problems:
  • High processing cost (6x extractors)
  • Slow ingestion
  • Storage bloat
  • Most features unused

Best Practice: Extract Selectively

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",  // Only what you'll query
    "input_mappings": { "text": "content" }
  }
}
Add more extractors only when:
  1. You have a concrete query use case
  2. You’ve tested alternatives
  3. The cost/benefit justifies it

Testing & Validation

Offline Evaluation

Test extractor outputs before production:
POST /v1/collections/{collection_id}/debug-extraction
{
  "object_id": "obj_sample_001",
  "return_embeddings": true,
  "return_metadata": true
}
Review:
  • Embedding quality (clustering, visualization)
  • Metadata accuracy (NER entities, OCR text)
  • Processing time
  • Credit consumption

A/B Testing Models

Create parallel collections with different models:
# Collection A: base model
POST /v1/collections
{
  "collection_name": "products-base",
  "feature_extractor": {
    "parameters": { "model": "multilingual-e5-base" }
  }
}

# Collection B: large model
POST /v1/collections
{
  "collection_name": "products-large",
  "feature_extractor": {
    "parameters": { "model": "multilingual-e5-large-instruct" }
  }
}
Compare retriever performance:
  • Precision@10, Recall@10
  • User click-through rate
  • Latency
  • Cost per query

Query Analysis

Identify which features contribute to results:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": { "query": "sample query" },
  "explain": true  // Returns feature contribution scores
}
{
  "results": [
    {
      "document_id": "doc_123",
      "score": 0.89,
      "feature_scores": {
        "text_embedding": 0.75,
        "bm25_sparse": 0.14
      }
    }
  ]
}

Migration & Reprocessing

Adding New Extractors

  1. Create new collection with additional extractor
  2. Process recent objects into new collection
  3. Update retriever to query both collections
  4. Archive old collection after validation

Changing Models

  1. Create new collection with updated model
  2. Reprocess objects in batches
  3. Compare retrieval quality (A/B test)
  4. Switch retriever to new collection
  5. Delete old collection after transition period

Incremental Updates

For large archives, prioritize reprocessing:
# Reprocess most-queried documents first
top_docs = mixpeek.analytics.get_top_documents(collection_id, limit=1000)

for doc in top_docs:
    source_object_id = doc["source_object_id"]
    # Reprocess object into new collection
    mixpeek.batches.create(object_ids=[source_object_id])

Best Practices

Begin with a single text or image extractor. Add multimodal extractors only when you have a clear use case and test data.
Use base models for high-volume, cost-sensitive workloads. Reserve large models for accuracy-critical applications.
Hybrid search (dense + sparse) handles both semantic and keyword queries. Enable enable_sparse: true for text extractors.
Offline metrics (NDCG, MRR) don’t always correlate with user satisfaction. A/B test with live traffic.
Track which features contribute to top results. Disable unused extractors to reduce cost.
When changing extractors, create new collections rather than mutating existing ones. This preserves reproducibility.

Next Steps