Document Intelligence

How It Works

When you ingest a document, Mixpeek runs a multi-stage pipeline:

Content Extraction — Text extraction from native PDFs, OCR fallback for scanned pages
Hierarchical Chunking — Documents split into pages, sections, or paragraphs with parent-child relationships
Semantic Extraction — Document type detection, section classification, and metadata inference
Multi-Vector Embeddings — Separate embeddings for titles, summaries, and full text
Indexing — Chunks stored with metadata for filtered vector search

At query time, the retriever searches across embeddings and joins results from multiple collections (text, tables, entities) back to their source documents.

Feature Extractors

Extractor	Use For
`pdf_extractor@v1`	Native PDF text, metadata, page chunking
`document_extractor@v1`	OCR for scanned docs, layout detection
`table_extractor@v1`	Table detection and cell extraction
`text_extractor@v1`	Text embeddings, NER, summarization

1. Create a Bucket

POST /v1/buckets
{
  "bucket_name": "contracts",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_type": { "type": "text" },
      "contract_date": { "type": "datetime" }
    }
  }
}

2. Create Collections

For text extraction:

POST /v1/collections
{
  "collection_name": "contracts-text",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "chunk_strategy": "page",
      "enable_ocr_fallback": true
    },
    "field_passthrough": [
      { "source_path": "document_type" },
      { "source_path": "contract_date" }
    ]
  }
}

For tables:

POST /v1/collections
{
  "collection_name": "contracts-tables",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "table_extractor",
    "version": "v1",
    "input_mappings": { "document_url": "document_url" },
    "parameters": {
      "output_format": "json",
      "min_confidence": 0.7
    }
  }
}

3. Ingest Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/2025/agreements",
  "metadata": {
    "document_type": "vendor_agreement",
    "contract_date": "2025-01-15T00:00:00Z"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/contracts/vendor-001.pdf"
    }
  ]
}

4. Process

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_001", "obj_002"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit

5. Create a Retriever

POST /v1/retrievers
{
  "retriever_name": "contract-search",
  "collection_ids": ["col_contracts_text", "col_contracts_tables"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "document_type": { "type": "text" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "field": "metadata.document_type",
          "operator": "eq",
          "value": "{{inputs.document_type}}"
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
        "input_mapping": { "text": "query" },
        "limit": 50
      }
    }
  ]
}

6. Query

POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "termination clauses with 30-day notice",
    "document_type": "vendor_agreement"
  },
  "limit": 10
}

Named Entity Recognition

Enable NER to extract entities like dates, amounts, and names:

{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "parameters": {
      "enable_ner": true,
      "entity_types": ["PERSON", "ORG", "DATE", "MONEY"]
    }
  }
}

Filter by entity:

{
  "filters": {
    "field": "metadata.entities.ORG",
    "operator": "contains",
    "value": "Acme Corp"
  }
}

Multi-Page Assembly

Retrieve all pages from a document using lineage:

GET /v1/documents/{document_id}/lineage

Tutorials

​How It Works

​Feature Extractors

​1. Create a Bucket

​2. Create Collections

​3. Ingest Documents

​4. Process

​5. Create a Retriever

​6. Query

​Named Entity Recognition

​Multi-Page Assembly