Skip to main content
Text extractor pipeline showing chunking, E5-Large embedding, and optional LLM extraction
The text extractor generates dense vector embeddings from text using the E5-Large multilingual model. Optimized for semantic search, RAG applications, and general-purpose text retrieval. Supports text chunking/decomposition with multiple splitting strategies. Fast (5ms/doc), cost-effective (free), and supports 100+ languages.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/text_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

When to Use

Use CaseDescription
Product searchSearch products by natural language descriptions
FAQ matchingMatch user questions to knowledge base articles
Document retrievalFind relevant documents from large corpora
Content discoveryRecommend similar content based on semantic similarity
RAG chunkingSplit documents into chunks for retrieval-augmented generation
Multi-language searchSearch across 100+ languages with a single model

When NOT to Use

ScenarioRecommended Alternative
Exact phrase matchingcolbert_extractor
Keyword-heavy queriessplade_extractor
High-precision legal/medical searchcolbert_extractor
Need for explainability (which keywords matched)splade_extractor
Documents with critical technical termscolbert_extractor
Very short texts (1-5 words)splade_extractor

Input Schema

FieldTypeRequiredDescription
textstringYesText content to process. Recommended: 10-400 words for optimal quality. Maximum: 512 tokens (~400 words), longer text is truncated.
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation, 30-hour battery life, and premium sound quality."
}
Input Examples:
TypeExample
Product description”Premium wireless Bluetooth headphones with active noise cancellation”
FAQ question”How do I reset my password if I forgot it?”
Article paragraph”Machine learning models have revolutionized natural language processing…”
User query”best restaurants near Times Square”

Output Schema

FieldTypeDescription
textstringThe processed text content (full text or chunk)
text_extractor_v1_embeddingfloat[1024]Dense vector embedding, L2 normalized
{
  "text": "Premium wireless Bluetooth headphones with active noise cancellation",
  "text_extractor_v1_embedding": [0.023, -0.041, 0.018, ...]
}
When chunking is enabled, each chunk becomes a separate document with tracking metadata stored in metadata (not in the document payload):
  • chunk_index – Position of this chunk in the original document
  • chunk_text – The text content of this chunk
  • total_chunks – Total number of chunks from the source

Parameters

Chunking Parameters

ParameterTypeDefaultDescription
split_bystring"none"Strategy for splitting text into chunks
chunk_sizeinteger1000Target size for each chunk (units depend on split_by)
chunk_overlapinteger0Number of units to overlap between consecutive chunks

Split Strategies

StrategyDescriptionBest For
charactersSplit by character countUniform sizes, quick testing
wordsSplit by word boundariesGeneral text, preserves words
sentencesSplit by sentence boundariesQ&A, precise retrieval, preserves semantic units
paragraphsSplit by paragraph (double newlines)Articles, documentation, natural structure
pagesSplit by page breaksPDFs, paginated documents
noneNo splitting (default)Short texts < 400 words
Recommended chunk sizes:
  • characters: 500-2000
  • words: 100-400
  • sentences: 3-10
  • paragraphs: 1-3
  • pages: 1
Chunk overlap: 10-20% of chunk_size helps preserve context across boundaries. Example: chunk_size: 1000, chunk_overlap: 100-200.

LLM Structured Extraction Parameters

ParameterTypeDefaultDescription
response_shapestring | objectnullDefine custom structured output using LLM extraction
llm_providerstringnullLLM provider: openai, google, anthropic
llm_modelstringnullSpecific model for extraction

response_shape Modes

Natural Language Mode (string):
{
  "response_shape": "Extract key entities, sentiment (positive/negative/neutral), and main topics from the text"
}
The service automatically infers JSON schema from your description. JSON Schema Mode (object):
{
  "response_shape": {
    "type": "object",
    "properties": {
      "sentiment": {
        "type": "string",
        "enum": ["positive", "negative", "neutral"]
      },
      "entities": {
        "type": "array",
        "items": { "type": "string" }
      },
      "topics": {
        "type": "array",
        "items": { "type": "string" },
        "maxItems": 5
      }
    },
    "required": ["sentiment"]
  }
}

LLM Provider & Model Options

ProviderExample Models
openaigpt-4o-mini-2024-07-18 (cost-effective), gpt-4o-2024-08-06 (best quality)
googlegemini-2.5-flash (fastest), gemini-1.5-flash-001
anthropicclaude-3-5-haiku-20241022 (fast), claude-3-5-sonnet-20241022 (best reasoning)

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "input_mappings": {
      "text": "payload.description"
    },
    "field_passthrough": [
      { "source_path": "metadata.product_id" }
    ],
    "parameters": {}
  }
}

Performance & Costs

MetricValue
Embedding latency~5ms per document (batched: ~2ms/doc)
Query latency5-10ms for top-100 results
CostFree (self-hosted E5-Large)
GPU requiredNo (but 5-10x faster with GPU)
Memory~4GB per 1M documents
Index build~1 hour per 10M documents
LLM extraction adds cost and latency based on provider pricing. Only use when structured extraction is needed.

Comparison with Other Text Extractors

Featuretext_extractorcolbert_extractorsplade_extractor
Accuracy (BEIR avg)88%92%90%
Speed (per doc)5ms15ms10ms
Storage per doc4KB500KB (125x more)20KB (5x more)
Query Latency< 10ms50-100ms20-30ms
Best ForGeneral searchPrecisionHybrid
Storage Cost (1M docs)$0.40$50$2
Multi-languageExcellentGoodGood
Exact MatchingPoorExcellentExcellent
Semantic MatchingExcellentExcellentGood

Vector Index

PropertyValue
Index nametext_extractor_v1_embedding
Dimensions1024
TypeDense
Distance metricCosine
Datatypefloat32
Inference modelmultilingual_e5_large_instruct_v1
NormalizationL2 normalized

Limitations

  • Token limit: 512 tokens (~400 words). Longer text is automatically truncated.
  • Exact phrases: Cannot reliably match exact phrases or technical terms.
  • Domain jargon: Struggles with very domain-specific jargon or acronyms.
  • Terminology variance: May miss documents that use different terminology for the same concept.
  • Short texts: Less effective for very short texts (1-5 words) where lexical matching is sufficient.
  • Keyword-heavy queries: Less effective for queries like “iPhone 15 Pro Max 256GB”.