Object Decomposition
Feature Extractors
| Extractor | Capabilities | Use Cases |
|---|---|---|
| document_extractor@v1 | OCR (Tesseract/Cloud Vision), layout detection, page segmentation | Scanned documents, invoices, forms |
| pdf_extractor@v1 | Native PDF text extraction, metadata, page-level chunking | Digital PDFs, reports, papers |
| table_extractor@v1 | Table detection, cell extraction, structure preservation | Financial statements, data sheets |
| image_extractor@v1 | Visual embeddings (CLIP), object detection, caption generation | Diagrams, charts, photos in documents |
| text_extractor@v1 | Text embeddings, named entity recognition (NER), summarization | Extracted text enrichment |
Implementation Steps
1. Create a Document Bucket
2. Define Multi-Extractor Collections
Text & Layout Collection:3. Register Documents
4. Process Documents
- Download PDFs from S3
- Extract text with OCR fallback for scanned pages
- Detect tables and extract structured data
- Generate embeddings for each page/section
- Create documents with lineage to source objects
5. Build a Document Search Retriever
6. Query Documents
Find relevant clauses:Model Evolution & A/B Testing
Experiment with OCR models, chunking strategies, and NER configurations without reprocessing your entire document archive.Test OCR Models
Test Chunking Strategies
Compare Results
- Clause detection: v1 (72%) vs v2 (89%) → better precision
- OCR accuracy: v1 (94%) vs v2 (98%) → fewer misreads
- Cost per page: v1 (0.01 credits) vs v2 (0.04 credits) → 4x cost
- Query success rate: v1 (68%) vs v2 (84%) → justified investment
Migrate Incrementally
Advanced Patterns
Multi-Page Document Assembly
For documents chunked by page, use lineage to reassemble:Named Entity Recognition (NER)
Extract entities like dates, amounts, party names:Document Comparison
Use vector similarity to find similar clauses across contracts:Template Matching
Create a taxonomy of standard clauses:Visual Document Search
For documents with diagrams, charts, or images:Document Summarization Pipeline
Generate executive summaries for long documents:Output Schema Examples
PDF Page Document:Performance Considerations
| Optimization | Impact |
|---|---|
| OCR model selection | Tesseract (fast, moderate accuracy) vs Cloud Vision (slower, high accuracy) |
| Chunk strategy | Page-level chunks reduce granularity; paragraph-level increases precision |
| Enable OCR fallback | Only for scanned pages; add 2-5s per page |
| Image extraction | Doubles processing time; disable if diagrams not needed |
| Table detection | Resource-intensive; apply only to document types with tables |
Use Case Examples
Legal Contract Analysis
Legal Contract Analysis
Extract clauses, identify key terms (termination, liability, indemnification), and compare across contracts. Use NER to track parties and dates. Generate risk scores with LLM analysis.
Invoice Processing
Invoice Processing
Extract line items, totals, vendor info, and payment terms. Use table extraction for itemized billing. Match invoices to purchase orders via semantic search.
Research Paper Discovery
Research Paper Discovery
Index academic papers with citation extraction. Search by abstract, methods, or findings. Cluster related papers and generate literature review summaries.
Medical Records Management
Medical Records Management
OCR scanned patient records, extract diagnoses and medications via NER. Enable HIPAA-compliant search with namespace isolation and audit logging.
Insurance Claims Processing
Insurance Claims Processing
Extract policy numbers, claim amounts, and incident descriptions. Match claims to policy documents. Flag anomalies with taxonomy-based risk classification.
Compliance & Security
Data Retention
Configure lifecycle policies for sensitive documents:Redaction
Use LLM stages to detect and redact PII:Access Control
Use namespaces to isolate document sets by department:Monitoring & Troubleshooting
Track Extraction Quality
Monitor__fully_enriched and __missing_features:
__fully_enriched: false rate is high:
- Check OCR quality (increase resolution, use better model)
- Review extractor logs for errors
- Verify document formats are supported
Validate Extracted Data
Sample documents and inspect:metadata.text and metadata.table_data for accuracy.
Next Steps
- Explore Feature Extractors for OCR and layout models
- Learn Taxonomies for clause classification
- Review Filters for complex document queries
- Check Security for compliance best practices

