Features are the extracted data elements that represent the content and characteristics of your documents. They are the building blocks that enable advanced search and retrieval capabilities.

Overview

Features in Mixpeek are structured data elements extracted from your content during processing. They represent specific aspects of your data such as:

  • Text embeddings
  • Image descriptors
  • Audio transcriptions
  • Video scene information
  • PDF content structures

Watch an Intro Video

1

Feature Selection

Choose which specific characteristics need to be extracted based on your retrieval and analysis requirements.

2

Extraction Processing

Run specialized extractors that process the content to generate features using AI models and algorithms.

3

Feature Storage

Store processed features in optimized feature stores designed for rapid retrieval and similarity searching.

Feature Extractors

Extractors are pre-built pipelines that have defined input and output schemas. Each extractor runs in parallel within a collection (in a queue). These extractors have optional parameters that can be configured when the collection is defined.

Feature extraction in Mixpeek often follows a similar pattern to pandas’ group_by operation (split, aggregate, merge). Just as pandas splits data into groups, applies operations, and combines results, feature extractors often have the ability to splits content by some characteristic, extracts relevant features, and merges them for unified retrieval.

Text Documents

Split: By paragraphs or sections
Aggregate: Extract embeddings, entities, and topics
Merge: Combine semantic vectors with metadata

Images

Split: By regions or objects
Aggregate: Extract visual features and scene descriptions
Merge: Combine visual embeddings with object detections

Audio

Split: By time segments or speakers
Aggregate: Extract transcriptions and audio features
Merge: Combine speech-to-text with audio embeddings

Video

Split: By scenes or frames
Aggregate: Extract visual and audio features per segment
Merge: Combine scene detections with temporal features

Vector Embeddings

High-dimensional numerical representations of content that capture semantic meaning

Metadata Features

Structured data fields such as categories, timestamps, and attributes

Media-Specific Features

Specialized features like image scene classifications, video timestamps, or audio speaker identification

Relational Features

Features that establish connections between different content items

Extraction Process

Features are created through feature extractors as part of processing pipelines. This happens in several stages:

1

Raw Content Analysis

Content is analyzed based on its type (text, image, video, etc.)

2

Feature Extraction

Specialized extractors process the content to generate features

3

Feature Normalization

Features are normalized into consistent formats

4

Storage

Processed features are stored in optimized feature stores

Feature Storage

Features are stored in specialized feature stores optimized for efficient retrieval. Unlike traditional database columns, feature stores are designed to handle:

  • High-dimensional vector data
  • Efficient similarity searching
  • Specialized indexes for multimodal content
  • Rapid retrieval of specific feature types

Best Practices

Understand Feature Types

Different content requires different feature types. Text works well with embeddings, while images need visual descriptors.

Feature Composition

Combine multiple features within multiple retrieval stages for more accurate retrieval. Text + image features provide better results than either alone.

Regular Updates

As your content evolves, consider reprocessing to generate updated features with the latest extractors.

Next Steps

Now that you understand features in Mixpeek, you can:

Limitations

  • Extraction Time: Complex feature extraction on large media files may require extended processing time
  • Model Specificity: Features are tied to the specific model version used during extraction
  • Storage Limits: Feature stores have capacity limits based on your account tier
  • Update Constraints: Features cannot be selectively updated; re-extraction of the entire document is required
  • Processing Dependencies: Feature extraction depends on the availability of third-party models and services
  • Cross-Feature Compatibility: Not all feature types can be directly compared or combined in search operations
  • Format Support: Some specialized formats may have limited feature extraction capabilities