Skip to main content
Document intelligence transforms unstructured files (PDFs, images, scanned documents) into queryable, analyzable data. This tutorial shows you how to combine OCR, layout analysis, table extraction, and semantic embeddings to unlock insights from document archives.

Object Decomposition

Feature Extractors

ExtractorCapabilitiesUse Cases
document_extractor@v1OCR (Tesseract/Cloud Vision), layout detection, page segmentationScanned documents, invoices, forms
pdf_extractor@v1Native PDF text extraction, metadata, page-level chunkingDigital PDFs, reports, papers
table_extractor@v1Table detection, cell extraction, structure preservationFinancial statements, data sheets
image_extractor@v1Visual embeddings (CLIP), object detection, caption generationDiagrams, charts, photos in documents
text_extractor@v1Text embeddings, named entity recognition (NER), summarizationExtracted text enrichment

Implementation Steps

1. Create a Document Bucket

POST /v1/buckets
{
  "bucket_name": "contracts-archive",
  "description": "Legal contracts and agreements",
  "schema": {
    "properties": {
      "document_url": { "type": "url", "required": true },
      "document_type": { "type": "text", "required": true },
      "contract_date": { "type": "datetime" },
      "parties": { "type": "array" },
      "jurisdiction": { "type": "text" }
    }
  }
}

2. Define Multi-Extractor Collections

Text & Layout Collection:
POST /v1/collections
{
  "collection_name": "contracts-text",
  "description": "Extracted text with layout metadata",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "version": "v1",
    "input_mappings": {
      "document_url": "document_url"
    },
    "parameters": {
      "extract_images": true,
      "preserve_layout": true,
      "chunk_strategy": "page",
      "enable_ocr_fallback": true
    },
    "field_passthrough": [
      { "source_path": "document_type" },
      { "source_path": "contract_date" },
      { "source_path": "jurisdiction" }
    ]
  }
}
Table Extraction Collection:
POST /v1/collections
{
  "collection_name": "contracts-tables",
  "description": "Structured table data",
  "source": { "type": "bucket", "bucket_id": "bkt_contracts" },
  "feature_extractor": {
    "feature_extractor_name": "table_extractor",
    "version": "v1",
    "input_mappings": {
      "document_url": "document_url"
    },
    "parameters": {
      "detection_model": "table-transformer",
      "output_format": "json",
      "min_confidence": 0.7
    }
  }
}

3. Register Documents

POST /v1/buckets/{bucket_id}/objects
{
  "key_prefix": "/2025/vendor-agreements",
  "metadata": {
    "document_type": "vendor_agreement",
    "contract_date": "2025-01-15T00:00:00Z",
    "parties": ["Acme Corp", "Supplier Inc"],
    "jurisdiction": "California"
  },
  "blobs": [
    {
      "property": "document_url",
      "type": "document",
      "url": "s3://my-bucket/contracts/vendor-2025-001.pdf"
    }
  ]
}

4. Process Documents

POST /v1/buckets/{bucket_id}/batches
{ "object_ids": ["obj_contract_001", "obj_contract_002"] }

POST /v1/buckets/{bucket_id}/batches/{batch_id}/submit
Processing steps:
  1. Download PDFs from S3
  2. Extract text with OCR fallback for scanned pages
  3. Detect tables and extract structured data
  4. Generate embeddings for each page/section
  5. Create documents with lineage to source objects

5. Build a Document Search Retriever

POST /v1/retrievers
{
  "retriever_name": "contract-search",
  "collection_ids": ["col_contracts_text", "col_contracts_tables"],
  "input_schema": {
    "properties": {
      "query": { "type": "text", "required": true },
      "document_type": { "type": "text" },
      "date_from": { "type": "datetime" },
      "date_to": { "type": "datetime" }
    }
  },
  "stages": [
    {
      "stage_name": "filter",
      "version": "v1",
      "parameters": {
        "filters": {
          "operator": "and",
          "conditions": [
            {
              "field": "metadata.document_type",
              "operator": "eq",
              "value": "{{inputs.document_type}}"
            },
            {
              "field": "metadata.contract_date",
              "operator": "between",
              "value": ["{{inputs.date_from}}", "{{inputs.date_to}}"]
            }
          ]
        }
      }
    },
    {
      "stage_name": "knn_search",
      "version": "v1",
      "parameters": {
        "feature_address": "mixpeek://pdf_extractor@v1/text_embedding",
        "input_mapping": { "text": "query" },
        "limit": 50
      }
    },
    {
      "stage_name": "llm_generation",
      "version": "v1",
      "parameters": {
        "model": "gpt-4o-mini",
        "prompt": "Summarize the following contract excerpt in 2-3 sentences: {{DOCUMENT.text}}",
        "output_field": "summary"
      }
    }
  ]
}

6. Query Documents

Find relevant clauses:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "termination clauses with 30-day notice period",
    "document_type": "vendor_agreement",
    "date_from": "2025-01-01T00:00:00Z",
    "date_to": "2025-12-31T23:59:59Z"
  },
  "limit": 10
}
Extract tables from specific documents:
POST /v1/collections/col_contracts_tables/documents/list
{
  "filters": {
    "field": "source_object_id",
    "operator": "eq",
    "value": "obj_contract_001"
  }
}

Model Evolution & A/B Testing

Experiment with OCR models, chunking strategies, and NER configurations without reprocessing your entire document archive.

Test OCR Models

# Production: Fast OCR
POST /v1/collections
{
  "collection_name": "contracts-text-v1",
  "feature_extractor": {
    "feature_extractor_name": "pdf_extractor",
    "parameters": { 
      "ocr_model": "tesseract-v5",
      "chunk_strategy": "page"
    }
  }
}

# Staging: Higher accuracy for scanned docs
POST /v1/collections
{
  "collection_name": "contracts-text-v2",
  "feature_extractor": {
    "parameters": { 
      "ocr_model": "cloud-vision",
      "chunk_strategy": "paragraph"
    }
  }
}

Test Chunking Strategies

# Baseline: Page-level chunks
POST /v1/collections
{
  "collection_name": "contracts-v1",
  "feature_extractor": {
    "parameters": { 
      "chunk_strategy": "page",
      "chunk_size": 2048
    }
  }
}

# Candidate: Clause-level chunks
POST /v1/collections
{
  "collection_name": "contracts-v2",
  "feature_extractor": {
    "parameters": { 
      "chunk_strategy": "paragraph",
      "chunk_size": 512,
      "chunk_overlap": 100
    }
  }
}

Compare Results

GET /v1/analytics/retrievers/compare?baseline=ret_v1&candidate=ret_v2
Impact:
  • Clause detection: v1 (72%) vs v2 (89%) → better precision
  • OCR accuracy: v1 (94%) vs v2 (98%) → fewer misreads
  • Cost per page: v1 (0.01 credits) vs v2 (0.04 credits) → 4x cost
  • Query success rate: v1 (68%) vs v2 (84%) → justified investment

Migrate Incrementally

# New documents use v2, old remain in v1
PATCH /v1/retrievers/{retriever_id}
{
  "collection_ids": ["col_contracts_v1", "col_contracts_v2"]
}

# Reprocess high-value documents only
POST /v1/buckets/{bucket_id}/batches
{
  "object_ids": ["obj_top_100_contracts"],
  "target_collection_id": "col_contracts_v2"
}

Advanced Patterns

Multi-Page Document Assembly

For documents chunked by page, use lineage to reassemble:
GET /v1/documents/{document_id}/lineage
Returns all pages derived from the same root object, enabling full document reconstruction.

Named Entity Recognition (NER)

Extract entities like dates, amounts, party names:
POST /v1/collections
{
  "feature_extractor": {
    "feature_extractor_name": "text_extractor",
    "version": "v1",
    "parameters": {
      "enable_ner": true,
      "entity_types": ["PERSON", "ORG", "DATE", "MONEY", "GPE"]
    }
  }
}
Documents will include:
{
  "metadata": {
    "entities": {
      "PERSON": ["John Doe", "Jane Smith"],
      "ORG": ["Acme Corp", "Supplier Inc"],
      "MONEY": ["$50,000", "$2,500/month"]
    }
  }
}
Filter by entity:
{
  "filters": {
    "field": "metadata.entities.ORG",
    "operator": "contains",
    "value": "Acme Corp"
  }
}

Document Comparison

Use vector similarity to find similar clauses across contracts:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query": "{{CONTRACT_A_CLAUSE_TEXT}}"
  },
  "filters": {
    "field": "source_object_id",
    "operator": "ne",
    "value": "obj_contract_a"
  },
  "limit": 5
}

Template Matching

Create a taxonomy of standard clauses:
POST /v1/taxonomies
{
  "taxonomy_name": "contract-clauses",
  "taxonomy_type": "flat",
  "retriever_id": "ret_clause_templates",
  "input_mappings": {
    "query_embedding": "mixpeek://pdf_extractor@v1/text_embedding"
  },
  "source_collection": {
    "collection_id": "col_standard_clauses",
    "enrichment_fields": [
      { "field_path": "metadata.clause_type", "merge_mode": "append" },
      { "field_path": "metadata.risk_level", "merge_mode": "replace" }
    ]
  }
}
Attach taxonomy to contracts collection for automatic clause classification during ingestion. For documents with diagrams, charts, or images:
POST /v1/collections
{
  "feature_extractor": {
    "feature_extractor_name": "image_extractor",
    "version": "v1",
    "input_mappings": {
      "images": "extracted_images"  # From pdf_extractor
    },
    "parameters": {
      "generate_captions": true,
      "detect_objects": ["chart", "diagram", "table"]
    }
  }
}
Search by image:
POST /v1/retrievers/{retriever_id}/execute
{
  "inputs": {
    "query_image": "s3://my-bucket/reference-chart.jpg"
  },
  "limit": 10
}

Document Summarization Pipeline

Generate executive summaries for long documents:
{
  "stages": [
    {
      "stage_name": "filter",
      "parameters": {
        "filters": {
          "field": "metadata.document_type",
          "operator": "eq",
          "value": "annual_report"
        }
      }
    },
    {
      "stage_name": "llm_generation",
      "parameters": {
        "model": "gpt-4o",
        "prompt": "Read this annual report and extract: 1) Key financial metrics, 2) Strategic initiatives, 3) Risk factors. Document: {{DOCUMENT.text}}",
        "max_tokens": 1000,
        "output_format": "json"
      }
    }
  ]
}

Output Schema Examples

PDF Page Document:
{
  "document_id": "doc_page_005",
  "source_object_id": "obj_contract_001",
  "metadata": {
    "document_type": "vendor_agreement",
    "page_number": 5,
    "total_pages": 12,
    "text": "This agreement may be terminated...",
    "layout": {
      "sections": [
        { "type": "header", "text": "Section 7: Termination" },
        { "type": "paragraph", "text": "..." }
      ]
    }
  },
  "feature_refs": [
    "mixpeek://pdf_extractor@v1/text_embedding"
  ]
}
Table Document:
{
  "document_id": "doc_table_003",
  "source_object_id": "obj_invoice_123",
  "metadata": {
    "table_index": 0,
    "page_number": 2,
    "table_data": {
      "headers": ["Item", "Quantity", "Unit Price", "Total"],
      "rows": [
        ["Widget A", "10", "$25.00", "$250.00"],
        ["Widget B", "5", "$50.00", "$250.00"]
      ]
    },
    "total_amount": "$500.00"
  }
}

Performance Considerations

OptimizationImpact
OCR model selectionTesseract (fast, moderate accuracy) vs Cloud Vision (slower, high accuracy)
Chunk strategyPage-level chunks reduce granularity; paragraph-level increases precision
Enable OCR fallbackOnly for scanned pages; add 2-5s per page
Image extractionDoubles processing time; disable if diagrams not needed
Table detectionResource-intensive; apply only to document types with tables

Use Case Examples

Extract line items, totals, vendor info, and payment terms. Use table extraction for itemized billing. Match invoices to purchase orders via semantic search.
Index academic papers with citation extraction. Search by abstract, methods, or findings. Cluster related papers and generate literature review summaries.
OCR scanned patient records, extract diagnoses and medications via NER. Enable HIPAA-compliant search with namespace isolation and audit logging.
Extract policy numbers, claim amounts, and incident descriptions. Match claims to policy documents. Flag anomalies with taxonomy-based risk classification.

Compliance & Security

Data Retention

Configure lifecycle policies for sensitive documents:
POST /v1/buckets/{bucket_id}
{
  "lifecycle_policy": {
    "delete_after_days": 90,
    "archive_to_cold_storage_after_days": 30
  }
}

Redaction

Use LLM stages to detect and redact PII:
{
  "stage_name": "llm_generation",
  "parameters": {
    "prompt": "Redact all PII (names, SSNs, addresses) from this text: {{DOCUMENT.text}}",
    "output_field": "redacted_text"
  }
}

Access Control

Use namespaces to isolate document sets by department:
# Finance team
X-Namespace: ns_finance

# Legal team
X-Namespace: ns_legal
Configure API keys with permission scoping to restrict access.

Monitoring & Troubleshooting

Track Extraction Quality

Monitor __fully_enriched and __missing_features:
POST /v1/collections/{collection_id}/documents/aggregate
{
  "group_by": ["__fully_enriched"],
  "metrics": ["count"]
}
If __fully_enriched: false rate is high:
  • Check OCR quality (increase resolution, use better model)
  • Review extractor logs for errors
  • Verify document formats are supported

Validate Extracted Data

Sample documents and inspect:
GET /v1/collections/{collection_id}/documents/{document_id}
Review metadata.text and metadata.table_data for accuracy.

Next Steps