Feature Extractors
Transform and enrich your data through customizable extraction pipelines
Feature extractors are specialized data transformation pipelines that extract meaningful information from your content. They represent the crucial “Transform” stage in the ETL (Extract, Transform, Load) ingestion process. Learn more about the overall architecture and core concepts.
Overview
Feature extractors in Mixpeek are purpose-built data transformation pipelines that convert raw, unstructured content into structured, meaningful features. They operate as a distinct step in the ingestion pipeline, sitting between initial document extraction and final indexing. For more details on data management, see buckets and collections.
Transformation Focus
Convert raw content into structured, searchable features specific to your use case
Pipeline Integration
Seamlessly integrate with document extraction and indexing stages in the ETL process
The Ingestion Pipeline
Extract (E)
Start with documents from either:
- Unindexed documents in a bucket
- Documents in a collection where the index is not being used
Transform (T)
Apply feature extractors to pull out and structure relevant data:
- Configure input handling:
individual
orgrouped
documents - Define output type:
single
ormultiple
documents - Map inputs and outputs between pipeline stages
- Set document handling strategies
Load (L)
Index the extracted features for efficient search and retrieval:
- Create new documents
- Update existing documents
- Merge with existing content
Configuration Options
When creating a collection with feature extractors, you can configure:
Feature Extractor Configuration
feature_extractor_name
: Name of the extractorversion
: Version of the extractorparameters
: Custom parameters for the extractorinput_mapping
: Maps pipeline inputs to extractor inputsoutput_mapping
: Maps extractor outputs to pipeline outputsdocument_output_type
: Type of output (single
ormultiple
)document_input_handling
: How documents are provided (individual
orgrouped
)document_output_handling
: How output is handled (create_new
)
For complete API details, see Create Collection API Reference.
Types of Feature Extractors
Common Use Cases
Document Analysis
Extract key information from business documents, contracts, and reports
Content Enrichment
Add structured metadata and features to enhance searchability
Media Processing
Extract features from images, audio, and video content
Data Standardization
Transform varied content into consistent, structured formats
Example: Creating a Collection with Feature Extractors
Here’s an example of how to create a collection that uses a bucket as its source and applies multiple feature extractors. For complete Python SDK documentation, see mixpeek on PyPI.
This example demonstrates:
- Creating a collection named “my-documents” with a description
- Using a bucket as the source of documents with:
- Specific bucket ID
- Optional prefix key for filtering objects
- Configurable filters for object or document selection (if the colleciton
source
is a bucket or collection respectively)
- Applying two feature extractors with complete configurations:
- Text extractor with:
- Custom parameters for text length limits
- Input/output mappings
- Document handling configurations
- Metadata extractor with:
- Parameters for specific metadata fields
- Custom input/output mappings
- Document handling strategies
- Text extractor with:
The feature extractor configuration allows you to:
- Specify custom parameters for each extractor
- Map inputs and outputs between pipeline stages
- Control how documents are handled during processing
- Define output types and handling strategies
For more examples and implementation details, see the Studio Walkthrough, API Reference, and Python SDK Documentation.
API Reference
For complete details on implementing and using feature extractors, see our Feature Extractors API Reference. For monitoring and managing extraction tasks, see Tasks API Reference.
Was this page helpful?