Features

Features are the extracted data elements that represent the content and characteristics of your documents. They are the building blocks that enable advanced search and retrieval capabilities.

Overview

Features in Mixpeek are structured data elements extracted from your content during processing. They represent specific aspects of your data such as:

Text embeddings
Image descriptors
Audio transcriptions
Video scene information
PDF content structures

Watch an Intro Video

Feature Selection

Choose which specific characteristics need to be extracted based on your retrieval and analysis requirements.

Extraction Processing

Run specialized extractors that process the content to generate features using AI models and algorithms.

Feature Storage

Store processed features in optimized feature stores designed for rapid retrieval and similarity searching.

Feature Extractors

Extractors are pre-built pipelines that have defined input and output schemas. Each extractor runs in parallel within a collection (in a queue). These extractors have optional parameters that can be configured when the collection is defined.

Feature extraction in Mixpeek often follows a similar pattern to pandas’ group_by operation (split, aggregate, merge). Just as pandas splits data into groups, applies operations, and combines results, feature extractors often have the ability to splits content by some characteristic, extracts relevant features, and merges them for unified retrieval.

Text Documents

Split: By paragraphs or sections
Aggregate: Extract embeddings, entities, and topics
Merge: Combine semantic vectors with metadata

Images

Split: By regions or objects
Aggregate: Extract visual features and scene descriptions
Merge: Combine visual embeddings with object detections

Audio

Split: By time segments or speakers
Aggregate: Extract transcriptions and audio features
Merge: Combine speech-to-text with audio embeddings

Video

Split: By scenes or frames
Aggregate: Extract visual and audio features per segment
Merge: Combine scene detections with temporal features

Vector Embeddings

High-dimensional numerical representations of content that capture semantic meaning

Metadata Features

Structured data fields such as categories, timestamps, and attributes

Media-Specific Features

Specialized features like image scene classifications, video timestamps, or audio speaker identification

Relational Features

Features that establish connections between different content items

Extraction Process

Features are created through feature extractors as part of processing pipelines. This happens in several stages:

Raw Content Analysis

Content is analyzed based on its type (text, image, video, etc.)

Feature Extraction

Specialized extractors process the content to generate features

Feature Normalization

Features are normalized into consistent formats

Storage

Processed features are stored in optimized feature stores

Feature Storage

Features are stored in specialized feature stores optimized for efficient retrieval. Unlike traditional database columns, feature stores are designed to handle:

High-dimensional vector data
Efficient similarity searching
Specialized indexes for multimodal content
Rapid retrieval of specific feature types

Best Practices

Understand Feature Types

Different content requires different feature types. Text works well with embeddings, while images need visual descriptors.

Feature Composition

Combine multiple features within multiple retrieval stages for more accurate retrieval. Text + image features provide better results than either alone.

Regular Updates

As your content evolves, consider reprocessing to generate updated features with the latest extractors.

Next Steps

Now that you understand features in Mixpeek, you can:

Explore Feature Extractors

Learn about the different feature extractors available

Limitations

Extraction Time: Complex feature extraction on large media files may require extended processing time
Model Specificity: Features are tied to the specific model version used during extraction
Storage Limits: Feature stores have capacity limits based on your account tier
Update Constraints: Features cannot be selectively updated; re-extraction of the entire document is required
Processing Dependencies: Feature extraction depends on the availability of third-party models and services
Cross-Feature Compatibility: Not all feature types can be directly compared or combined in search operations
Format Support: Some specialized formats may have limited feature extraction capabilities

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

Overview

Feature Extractors

Text Documents

Images

Audio

Video

Vector Embeddings

Metadata Features

Media-Specific Features

Relational Features

Extraction Process

Feature Storage

Best Practices

Understand Feature Types

Feature Composition

Regular Updates

Next Steps

Explore Feature Extractors

Limitations

Overview

Data Management

Data Processing

Search & Retrieval

Data Enrichment

Troubleshooting

​Overview

​Feature Extractors

Text Documents

Images

Audio

Video

Vector Embeddings

Metadata Features

Media-Specific Features

Relational Features

​Extraction Process

​Feature Storage

​Best Practices

Understand Feature Types

Feature Composition

Regular Updates

​Next Steps

Explore Feature Extractors

​Limitations

Overview

Feature Extractors

Extraction Process

Feature Storage

Best Practices

Next Steps

Limitations