View extractor details at api.mixpeek.com/v1/collections/features/extractors/text_extractor_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.When to Use
| Use Case | Description |
|---|---|
| Product search | Search products by natural language descriptions |
| FAQ matching | Match user questions to knowledge base articles |
| Document retrieval | Find relevant documents from large corpora |
| Content discovery | Recommend similar content based on semantic similarity |
| RAG chunking | Split documents into chunks for retrieval-augmented generation |
| Multi-language search | Search across 100+ languages with a single model |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Exact phrase matching | colbert_extractor |
| Keyword-heavy queries | splade_extractor |
| High-precision legal/medical search | colbert_extractor |
| Need for explainability (which keywords matched) | splade_extractor |
| Documents with critical technical terms | colbert_extractor |
| Very short texts (1-5 words) | splade_extractor |
Input Schema
| Field | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text content to process. Recommended: 10-400 words for optimal quality. Maximum: 512 tokens (~400 words), longer text is truncated. |
| Type | Example |
|---|---|
| Product description | ”Premium wireless Bluetooth headphones with active noise cancellation” |
| FAQ question | ”How do I reset my password if I forgot it?” |
| Article paragraph | ”Machine learning models have revolutionized natural language processing…” |
| User query | ”best restaurants near Times Square” |
Output Schema
| Field | Type | Description |
|---|---|---|
text | string | The processed text content (full text or chunk) |
text_extractor_v1_embedding | float[1024] | Dense vector embedding, L2 normalized |
metadata (not in the document payload):
chunk_index– Position of this chunk in the original documentchunk_text– The text content of this chunktotal_chunks– Total number of chunks from the source
Parameters
Chunking Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
split_by | string | "none" | Strategy for splitting text into chunks |
chunk_size | integer | 1000 | Target size for each chunk (units depend on split_by) |
chunk_overlap | integer | 0 | Number of units to overlap between consecutive chunks |
Split Strategies
| Strategy | Description | Best For |
|---|---|---|
characters | Split by character count | Uniform sizes, quick testing |
words | Split by word boundaries | General text, preserves words |
sentences | Split by sentence boundaries | Q&A, precise retrieval, preserves semantic units |
paragraphs | Split by paragraph (double newlines) | Articles, documentation, natural structure |
pages | Split by page breaks | PDFs, paginated documents |
none | No splitting (default) | Short texts < 400 words |
characters: 500-2000words: 100-400sentences: 3-10paragraphs: 1-3pages: 1
chunk_size helps preserve context across boundaries. Example: chunk_size: 1000, chunk_overlap: 100-200.
LLM Structured Extraction Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
response_shape | string | object | null | Define custom structured output using LLM extraction |
llm_provider | string | null | LLM provider: openai, google, anthropic |
llm_model | string | null | Specific model for extraction |
response_shape Modes
Natural Language Mode (string):LLM Provider & Model Options
| Provider | Example Models |
|---|---|
openai | gpt-4o-mini-2024-07-18 (cost-effective), gpt-4o-2024-08-06 (best quality) |
google | gemini-2.5-flash (fastest), gemini-1.5-flash-001 |
anthropic | claude-3-5-haiku-20241022 (fast), claude-3-5-sonnet-20241022 (best reasoning) |
Configuration Examples
Performance & Costs
| Metric | Value |
|---|---|
| Embedding latency | ~5ms per document (batched: ~2ms/doc) |
| Query latency | 5-10ms for top-100 results |
| Cost | Free (self-hosted E5-Large) |
| GPU required | No (but 5-10x faster with GPU) |
| Memory | ~4GB per 1M documents |
| Index build | ~1 hour per 10M documents |
Comparison with Other Text Extractors
| Feature | text_extractor | colbert_extractor | splade_extractor |
|---|---|---|---|
| Accuracy (BEIR avg) | 88% | 92% | 90% |
| Speed (per doc) | 5ms | 15ms | 10ms |
| Storage per doc | 4KB | 500KB (125x more) | 20KB (5x more) |
| Query Latency | < 10ms | 50-100ms | 20-30ms |
| Best For | General search | Precision | Hybrid |
| Storage Cost (1M docs) | $0.40 | $50 | $2 |
| Multi-language | Excellent | Good | Good |
| Exact Matching | Poor | Excellent | Excellent |
| Semantic Matching | Excellent | Excellent | Good |
Vector Index
| Property | Value |
|---|---|
| Index name | text_extractor_v1_embedding |
| Dimensions | 1024 |
| Type | Dense |
| Distance metric | Cosine |
| Datatype | float32 |
| Inference model | multilingual_e5_large_instruct_v1 |
| Normalization | L2 normalized |
Limitations
- Token limit: 512 tokens (~400 words). Longer text is automatically truncated.
- Exact phrases: Cannot reliably match exact phrases or technical terms.
- Domain jargon: Struggles with very domain-specific jargon or acronyms.
- Terminology variance: May miss documents that use different terminology for the same concept.
- Short texts: Less effective for very short texts (1-5 words) where lexical matching is sufficient.
- Keyword-heavy queries: Less effective for queries like “iPhone 15 Pro Max 256GB”.

