Feature extractors are specialized data transformation pipelines that extract meaningful information from your content. They represent the crucial “Transform” stage in the ETL (Extract, Transform, Load) ingestion process. Learn more about the overall architecture and core concepts.

Overview

Feature extractors in Mixpeek are purpose-built data transformation pipelines that convert raw, unstructured content into structured, meaningful features. They operate as a distinct step in the ingestion pipeline, sitting between initial document extraction and final indexing. For more details on data management, see buckets and collections.

Transformation Focus

Convert raw content into structured, searchable features specific to your use case

Pipeline Integration

Seamlessly integrate with document extraction and indexing stages in the ETL process

The Ingestion Pipeline

1

Extract (E)

Start with documents from either:

  • Unindexed documents in a bucket
  • Documents in a collection where the index is not being used
2

Transform (T)

Apply feature extractors to pull out and structure relevant data:

  • Configure input handling: individual or grouped documents
  • Define output type: single or multiple documents
  • Map inputs and outputs between pipeline stages
  • Set document handling strategies
3

Load (L)

Index the extracted features for efficient search and retrieval:

  • Create new documents
  • Update existing documents
  • Merge with existing content

Configuration Options

When creating a collection with feature extractors, you can configure:

Feature Extractor Configuration

  • feature_extractor_name: Name of the extractor
  • version: Version of the extractor
  • parameters: Custom parameters for the extractor
  • input_mapping: Maps pipeline inputs to extractor inputs
  • output_mapping: Maps extractor outputs to pipeline outputs
  • document_output_type: Type of output (single or multiple)
  • document_input_handling: How documents are provided (individual or grouped)
  • document_output_handling: How output is handled (create_new)

For complete API details, see Create Collection API Reference.

Types of Feature Extractors

Common Use Cases

Document Analysis

Extract key information from business documents, contracts, and reports

Content Enrichment

Add structured metadata and features to enhance searchability

Media Processing

Extract features from images, audio, and video content

Data Standardization

Transform varied content into consistent, structured formats

Example: Creating a Collection with Feature Extractors

Here’s an example of how to create a collection that uses a bucket as its source and applies multiple feature extractors. For complete Python SDK documentation, see mixpeek on PyPI.

import mixpeek
from mixpeek import Mixpeek
import os

# Initialize the Mixpeek client
with Mixpeek(
    token=os.getenv("MIXPEEK_TOKEN", ""),
) as m_client:

    # Create a collection with feature extractors
    res = m_client.collections.create(
        collection_name="my-documents",
        description="A collection of processed business documents",
        source={
            "type": mixpeek.SourceType.BUCKET,
            "bucket_id": "bucket_1234567890",  # ID of the source bucket
            "prefix_key": "documents/",  # Optional prefix to filter bucket objects
            "filters": {
                "and_": [],  # List of AND conditions
                "or_": [],   # List of OR conditions
                "not_": [],  # List of NOT conditions
                "case_sensitive": True,  # Whether filters are case sensitive
            },
        },
        feature_extractors=[
            {
                "feature_extractor_name": "text-extractor",
                "version": "1.0.0",
                "parameters": {
                    "min_length": 100,
                    "max_length": 1000
                },
                "input_mapping": {
                    "source_text": "content"
                },
                "output_mapping": {
                    "extracted_text": "processed_content"
                },
                "document_output_type": "TEXT",
                "document_input_handling": {
                    "type": "DIRECT",
                    "chunk_size": 1024
                },
                "document_output_handling": {
                    "type": "MERGE",
                    "strategy": "APPEND"
                }
            }
        ]
    )

    # Handle response
    print(res)

This example demonstrates:

  • Creating a collection named “my-documents” with a description
  • Using a bucket as the source of documents with:
    • Specific bucket ID
    • Optional prefix key for filtering objects
    • Configurable filters for object or document selection (if the colleciton source is a bucket or collection respectively)
  • Applying two feature extractors with complete configurations:
    • Text extractor with:
      • Custom parameters for text length limits
      • Input/output mappings
      • Document handling configurations
    • Metadata extractor with:
      • Parameters for specific metadata fields
      • Custom input/output mappings
      • Document handling strategies

The feature extractor configuration allows you to:

  • Specify custom parameters for each extractor
  • Map inputs and outputs between pipeline stages
  • Control how documents are handled during processing
  • Define output types and handling strategies

For more examples and implementation details, see the Studio Walkthrough, API Reference, and Python SDK Documentation.

API Reference

For complete details on implementing and using feature extractors, see our Feature Extractors API Reference. For monitoring and managing extraction tasks, see Tasks API Reference.