How It Works
When you ingest a document, Mixpeek runs a multi-stage pipeline:- Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
- Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
- Semantic Extraction — Document type detection, section classification, and metadata inference
- Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
- Indexing — Chunks stored with metadata for filtered vector search
Feature Extractors
| Extractor | Use For |
|---|---|
pdf_extractor@v1 | Native PDF text, metadata, page chunking |
document_extractor@v1 | OCR for scanned docs, layout detection |
table_extractor@v1 | Table detection and cell extraction |
text_extractor@v1 | Text embeddings, NER, summarization |

