Skip to main content
Document Intelligence Pipeline

How It Works

When you ingest a document, Mixpeek runs a multi-stage pipeline:
  1. Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
  2. Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
  3. Semantic Extraction — Document type detection, section classification, and metadata inference
  4. Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
  5. Indexing — Chunks stored with metadata for filtered vector search
At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.

Feature Extractors

ExtractorUse For
pdf_extractor@v1Native PDF text, metadata, page chunking
document_extractor@v1OCR for scanned docs, layout detection
table_extractor@v1Table detection and cell extraction
text_extractor@v1Text embeddings, NER, summarization

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "contracts",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_type": { "type": "text" },
      "contract_date": { "type": "datetime" }
    }
  }
}

2. Create Collections

For text extraction:
POST /v1/collections
{
  "collection_name": "contracts-text",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "chunk_strategy": "page",
      "enable_ocr_fallback": true
    },
    "field_passthrough": [
      { "source_path": "document_type" },
      { "source_path": "contract_date" }
    ]
  }
}
For tables:
POST /v1/collections
{
  "collection_name": "contracts-tables",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "table_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "output_format": "json",
      "min_confidence": 0.7
    }
  }
}

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/2025/agreements",
  "metadata": {
    "document_type": "vendor_agreement",
    "contract_date": "2025-01-15T00:00:00Z"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/contracts/vendor-001.pdf"
    }
  ]
}

4. Process

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Retriever

POST /v1/retrievers
{
  "retriever_name": "contract-search",
  "collection_ids": ["col_contracts_text", "col_contracts_tables"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "document_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.document_type",
          "operator": "eq",
          "value": "{{inputs.document_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
        "input_mapping": { "text": "query" },
        "limit": 50
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "termination clauses with 30-day notice",
    "document_type": "vendor_agreement"
  },
  "limit": 10
}

Named Entity Recognition

Enable NER to extract entities like dates, amounts, and names:
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "parameters": {
      "enable_ner": true,
      "entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
    }
  }
}
Filter by entity:
{
  "filters": {
    "field": "metadata.entities.ORG",
    "operator": "contains",
    "value": "Acme Corp"
  }
}

Multi-Page Assembly

Retrieve all pages from a document using lineage:
GET /v1/documents/{document_id}/lineage