Skip to main content
Document graph extractor pipeline showing PDF parsing, layout detection, and block extraction
The document graph extractor processes PDFs by extracting spatial blocks with layout classification (paragraphs, tables, forms, lists, headers, footers, figures, handwriting). Includes confidence scoring and optional VLM correction for low-confidence blocks. Best for archival documents, scanned files, and documents requiring spatial understanding.
View extractor details at api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1 or fetch programmatically with GET /v1/collections/features/extractors/{feature_extractor_id}.

Pipeline Steps

  1. Filter Dataset (if collection_id provided)
    • Filter to specified collection
  2. PDF URL Resolution
    • Find PDF URL from row data (data, pdf_url, document_url, file_url, etc.)
    • Convert S3 keys to full S3 URLs if needed
  3. Layout Detection Mode Fork
    • If use_layout_detection=true (NEW - ML-based): a. PaddleOCR layout detection (finds ALL elements: text, images, tables) b. Skip to Step 4 (object_type already set by detector)
    • If use_layout_detection=false (LEGACY - Text-only): a. PyMuPDF span extraction (text with bounding boxes) b. Spatial clustering (group nearby spans into logical blocks) c. Layout classification (rule-based: paragraph, table, form, etc.)
  4. Confidence Scoring
    • Score extraction quality with A/B/C/D tags
    • Based on OCR quality, spatial coherence, text patterns
  5. Text Cleaning
    • Remove OCR artifacts
    • Normalize whitespace
  6. Page Rendering (conditional: if generate_thumbnails=true OR use_vlm_correction=true)
    • Full page thumbnails at configured DPI
    • Segment-level thumbnails for each block
  7. VLM Correction (conditional: if use_vlm_correction=true AND NOT fast_mode AND confidence C/D)
    • Gemini/OpenAI/Anthropic vision models correct low-confidence text
    • Only applied to blocks with poor extraction quality
  8. Text Embedding (conditional: if run_text_embedding=true)
    • E5-Large embeddings (1024D) for semantic search
  9. Output
    • Block-level documents with text, layout type, bbox, confidence, embeddings

When to Use

Use CaseDescription
Archival documentsExtract structured data from scanned historical documents
Scanned PDFsProcess documents with mixed text quality
Forms processingIdentify and extract form fields, tables, and structured data
Document understandingAnalyze document layout and structure
Spatial searchFind specific sections or blocks within documents
Multi-layout documentsProcess documents with complex layouts (reports, contracts, etc.)

When NOT to Use

ScenarioRecommended Alternative
Simple text extractiontext_extractor
Images onlyimage_extractor
Video/audio contentmultimodal_extractor
Born-digital PDFs with perfect texttext_extractor (faster, simpler)

Input Schema

FieldTypeRequiredDescription
pdfstringYesURL or S3 path to PDF file. Supports multi-page PDFs.
{
  "pdf": "s3://my-bucket/documents/invoice-2024.pdf"
}
Input Examples:
TypeExample
Invoices3://documents/invoices/inv-001.pdf
Contracthttps://cdn.example.com/contracts/lease.pdf
Scanned documents3://archive/scanned/1985-report.pdf
Forms3://forms/application-form.pdf
Supported Formats: PDF only Recommended: 150-300 DPI for scanned documents Max File Size: 100MB per PDF

Output Schema

Each spatial block produces one document with the following fields:
FieldTypeDescription
textstringExtracted text content (raw or VLM-corrected)
object_typestringLayout type: paragraph, table, form, list, header, footer, figure, handwritten
bboxobjectBounding box {x, y, width, height} in PDF coordinates
page_numberintegerPage number (0-indexed)
confidence_tagstringConfidence grade: A (high), B (good), C (fair), D (poor)
confidence_scorenumberConfidence score (0.0-1.0)
document_graph_extractor_v1_text_embeddingfloat[1024]E5-Large text embedding (if enabled)
page_image_urlstringFull page thumbnail URL (if generated)
segment_thumbnail_urlstringBlock-specific thumbnail URL (if generated)
thumbnail_urlstringPage thumbnail URL (if generated)
{
  "text": "INVOICE #12345\nDate: January 15, 2024\nAmount Due: $1,250.00",
  "object_type": "header",
  "bbox": {"x": 50, "y": 720, "width": 500, "height": 80},
  "page_number": 0,
  "confidence_tag": "A",
  "confidence_score": 0.95,
  "document_graph_extractor_v1_text_embedding": [0.023, -0.041, ...],
  "page_image_url": "s3://mixpeek/thumbnails/page_0.jpg",
  "segment_thumbnail_url": "s3://mixpeek/thumbnails/seg_0_header.jpg"
}

Parameters

Layout Detection Parameters

ParameterTypeDefaultDescription
use_layout_detectionbooleanfalseUse ML-based PaddleOCR layout detection (finds images + tables + text) vs legacy text-only extraction
render_dpiinteger150DPI for PDF page rendering (72-300). Higher = better quality, slower processing

VLM Correction Parameters

ParameterTypeDefaultDescription
use_vlm_correctionbooleanfalseEnable VLM correction for low-confidence blocks (C/D tags)
min_confidence_for_vlmstring"C"Minimum confidence tag to trigger VLM correction: A, B, C, or D
vlm_providerstring"google"VLM provider: google, openai, anthropic
vlm_modelstring"gemini-2.0-flash"Specific VLM model for correction
fast_modebooleanfalseSkip VLM correction even if enabled (for faster processing)

Clustering Parameters (Legacy mode only)

ParameterTypeDefaultDescription
vertical_thresholdnumber10.0Vertical distance threshold for grouping text spans
horizontal_thresholdnumber5.0Horizontal distance threshold for grouping text spans
min_text_lengthinteger1Minimum text length to include in blocks

Confidence & Embedding Parameters

ParameterTypeDefaultDescription
base_confidencenumber0.8Base confidence score for extracted blocks
run_text_embeddingbooleantrueGenerate E5-Large embeddings for semantic search

Thumbnail Parameters

ParameterTypeDefaultDescription
generate_thumbnailsbooleantrueGenerate page and segment thumbnails
thumbnail_dpiinteger72DPI for thumbnail generation
thumbnail_modestring"fit"Thumbnail resize mode: fit, fill, crop

Configuration Examples

{
  "feature_extractor": {
    "feature_extractor_name": "document_graph_extractor",
    "version": "v1",
    "input_mappings": {
      "pdf": "payload.document_url"
    },
    "field_passthrough": [
      { "source_path": "metadata.invoice_id" },
      { "source_path": "metadata.vendor" }
    ],
    "parameters": {
      "use_layout_detection": true,
      "render_dpi": 150,
      "generate_thumbnails": true,
      "run_text_embedding": true
    }
  }
}

Performance & Costs

MetricValue
Processing speed~1-5 pages/sec (depends on DPI and features enabled)
Layout detection~500ms per page (PaddleOCR)
VLM correction~2s per low-confidence block
Embedding generation~5ms per block
Cost (minimal)~$0.001/page (text extraction only)
Cost (with VLM)~0.010.01-0.05/page (depends on # of low-confidence blocks)

Vector Index

PropertyValue
Index namedocument_graph_extractor_v1_text_embedding
Dimensions1024
TypeDense
Distance metricCosine
Datatypefloat32
Inference modelmultilingual_e5_large_instruct_v1

Layout Types

The extractor classifies blocks into these layout types:
TypeDescriptionExample Use Case
paragraphBody text blocksArticle content, descriptions
tableTabular dataFinancial tables, data grids
formForm fields and labelsApplication forms, surveys
listBulleted or numbered listsRequirements, instructions
headerPage headersDocument titles, section headers
footerPage footersPage numbers, disclaimers
figureImages and captionsCharts, diagrams, photos
handwrittenHandwritten textSignatures, annotations

Confidence Tags

Extraction quality is graded with confidence tags:
TagConfidenceDescriptionAction
A0.9-1.0ExcellentNo correction needed
B0.7-0.9GoodMinor issues, usually acceptable
C0.5-0.7FairVLM correction recommended
D0.0-0.5PoorVLM correction strongly recommended

Comparison: ML Layout Detection vs Legacy

FeatureML Layout DetectionLegacy Text-Only
Finds images✅ Yes❌ No
Finds tables✅ Yes (better accuracy)⚠️ Basic heuristics
Processing speedSlower (~500ms/page)Faster (~100ms/page)
Best forComplex layouts, scanned docsSimple text-only PDFs
ModelPaddleOCRPyMuPDF + heuristics

Limitations

  • PDF only: Does not process images, Word docs, or other formats
  • Memory intensive: Large PDFs (100+ pages) may require increased memory
  • VLM costs: VLM correction adds significant cost for low-confidence documents
  • Language support: OCR works best with Latin scripts; non-Latin may have reduced accuracy
  • Handwriting: Handwritten text detection is experimental and less reliable