View extractor details at api.mixpeek.com/v1/collections/features/extractors/document_graph_extractor_v1 or fetch programmatically with
GET /v1/collections/features/extractors/{feature_extractor_id}.Pipeline Steps
- Filter Dataset (if collection_id provided)
- Filter to specified collection
- PDF URL Resolution
- Find PDF URL from row data (data, pdf_url, document_url, file_url, etc.)
- Convert S3 keys to full S3 URLs if needed
- Layout Detection Mode Fork
- If
use_layout_detection=true(NEW - ML-based): a. PaddleOCR layout detection (finds ALL elements: text, images, tables) b. Skip to Step 4 (object_type already set by detector) - If
use_layout_detection=false(LEGACY - Text-only): a. PyMuPDF span extraction (text with bounding boxes) b. Spatial clustering (group nearby spans into logical blocks) c. Layout classification (rule-based: paragraph, table, form, etc.)
- If
- Confidence Scoring
- Score extraction quality with A/B/C/D tags
- Based on OCR quality, spatial coherence, text patterns
- Text Cleaning
- Remove OCR artifacts
- Normalize whitespace
- Page Rendering (conditional: if
generate_thumbnails=trueORuse_vlm_correction=true)- Full page thumbnails at configured DPI
- Segment-level thumbnails for each block
- VLM Correction (conditional: if
use_vlm_correction=trueAND NOTfast_modeAND confidence C/D)- Gemini/OpenAI/Anthropic vision models correct low-confidence text
- Only applied to blocks with poor extraction quality
- Text Embedding (conditional: if
run_text_embedding=true)- E5-Large embeddings (1024D) for semantic search
- Output
- Block-level documents with text, layout type, bbox, confidence, embeddings
When to Use
| Use Case | Description |
|---|---|
| Archival documents | Extract structured data from scanned historical documents |
| Scanned PDFs | Process documents with mixed text quality |
| Forms processing | Identify and extract form fields, tables, and structured data |
| Document understanding | Analyze document layout and structure |
| Spatial search | Find specific sections or blocks within documents |
| Multi-layout documents | Process documents with complex layouts (reports, contracts, etc.) |
When NOT to Use
| Scenario | Recommended Alternative |
|---|---|
| Simple text extraction | text_extractor |
| Images only | image_extractor |
| Video/audio content | multimodal_extractor |
| Born-digital PDFs with perfect text | text_extractor (faster, simpler) |
Input Schema
| Field | Type | Required | Description |
|---|---|---|---|
pdf | string | Yes | URL or S3 path to PDF file. Supports multi-page PDFs. |
| Type | Example |
|---|---|
| Invoice | s3://documents/invoices/inv-001.pdf |
| Contract | https://cdn.example.com/contracts/lease.pdf |
| Scanned document | s3://archive/scanned/1985-report.pdf |
| Form | s3://forms/application-form.pdf |
Output Schema
Each spatial block produces one document with the following fields:| Field | Type | Description |
|---|---|---|
text | string | Extracted text content (raw or VLM-corrected) |
object_type | string | Layout type: paragraph, table, form, list, header, footer, figure, handwritten |
bbox | object | Bounding box {x, y, width, height} in PDF coordinates |
page_number | integer | Page number (0-indexed) |
confidence_tag | string | Confidence grade: A (high), B (good), C (fair), D (poor) |
confidence_score | number | Confidence score (0.0-1.0) |
document_graph_extractor_v1_text_embedding | float[1024] | E5-Large text embedding (if enabled) |
page_image_url | string | Full page thumbnail URL (if generated) |
segment_thumbnail_url | string | Block-specific thumbnail URL (if generated) |
thumbnail_url | string | Page thumbnail URL (if generated) |
Parameters
Layout Detection Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
use_layout_detection | boolean | false | Use ML-based PaddleOCR layout detection (finds images + tables + text) vs legacy text-only extraction |
render_dpi | integer | 150 | DPI for PDF page rendering (72-300). Higher = better quality, slower processing |
VLM Correction Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
use_vlm_correction | boolean | false | Enable VLM correction for low-confidence blocks (C/D tags) |
min_confidence_for_vlm | string | "C" | Minimum confidence tag to trigger VLM correction: A, B, C, or D |
vlm_provider | string | "google" | VLM provider: google, openai, anthropic |
vlm_model | string | "gemini-2.0-flash" | Specific VLM model for correction |
fast_mode | boolean | false | Skip VLM correction even if enabled (for faster processing) |
Clustering Parameters (Legacy mode only)
| Parameter | Type | Default | Description |
|---|---|---|---|
vertical_threshold | number | 10.0 | Vertical distance threshold for grouping text spans |
horizontal_threshold | number | 5.0 | Horizontal distance threshold for grouping text spans |
min_text_length | integer | 1 | Minimum text length to include in blocks |
Confidence & Embedding Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
base_confidence | number | 0.8 | Base confidence score for extracted blocks |
run_text_embedding | boolean | true | Generate E5-Large embeddings for semantic search |
Thumbnail Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
generate_thumbnails | boolean | true | Generate page and segment thumbnails |
thumbnail_dpi | integer | 72 | DPI for thumbnail generation |
thumbnail_mode | string | "fit" | Thumbnail resize mode: fit, fill, crop |
Configuration Examples
Performance & Costs
| Metric | Value |
|---|---|
| Processing speed | ~1-5 pages/sec (depends on DPI and features enabled) |
| Layout detection | ~500ms per page (PaddleOCR) |
| VLM correction | ~2s per low-confidence block |
| Embedding generation | ~5ms per block |
| Cost (minimal) | ~$0.001/page (text extraction only) |
| Cost (with VLM) | ~0.05/page (depends on # of low-confidence blocks) |
Vector Index
| Property | Value |
|---|---|
| Index name | document_graph_extractor_v1_text_embedding |
| Dimensions | 1024 |
| Type | Dense |
| Distance metric | Cosine |
| Datatype | float32 |
| Inference model | multilingual_e5_large_instruct_v1 |
Layout Types
The extractor classifies blocks into these layout types:| Type | Description | Example Use Case |
|---|---|---|
paragraph | Body text blocks | Article content, descriptions |
table | Tabular data | Financial tables, data grids |
form | Form fields and labels | Application forms, surveys |
list | Bulleted or numbered lists | Requirements, instructions |
header | Page headers | Document titles, section headers |
footer | Page footers | Page numbers, disclaimers |
figure | Images and captions | Charts, diagrams, photos |
handwritten | Handwritten text | Signatures, annotations |
Confidence Tags
Extraction quality is graded with confidence tags:| Tag | Confidence | Description | Action |
|---|---|---|---|
| A | 0.9-1.0 | Excellent | No correction needed |
| B | 0.7-0.9 | Good | Minor issues, usually acceptable |
| C | 0.5-0.7 | Fair | VLM correction recommended |
| D | 0.0-0.5 | Poor | VLM correction strongly recommended |
Comparison: ML Layout Detection vs Legacy
| Feature | ML Layout Detection | Legacy Text-Only |
|---|---|---|
| Finds images | ✅ Yes | ❌ No |
| Finds tables | ✅ Yes (better accuracy) | ⚠️ Basic heuristics |
| Processing speed | Slower (~500ms/page) | Faster (~100ms/page) |
| Best for | Complex layouts, scanned docs | Simple text-only PDFs |
| Model | PaddleOCR | PyMuPDF + heuristics |
Limitations
- PDF only: Does not process images, Word docs, or other formats
- Memory intensive: Large PDFs (100+ pages) may require increased memory
- VLM costs: VLM correction adds significant cost for low-confidence documents
- Language support: OCR works best with Latin scripts; non-Latin may have reduced accuracy
- Handwriting: Handwritten text detection is experimental and less reliable

