Collections store processed documents with a consistent schema. They are the primary containers for structured data that has been extracted from raw objects.

Overview

When raw objects are processed by feature extractors, the resulting structured data is stored in collections as documents.

1

Select Source

Choose whether to use a bucket or another collection as your data source. Buckets provide raw files, while collections offer already processed documents.

2

Select Feature Extractors

Determine which feature extractors will process your source data. These extractors define what information will be derived from your content.

3

Pass Properties from Source

Specify which properties from your source (blobs from buckets or fields from collections) should be passed to your feature extractors.

4

Configure Feature Extractors

Customize settings for each feature extractor to optimize their performance for your specific use case and content types.

5

Configure Processing

Set up additional processing like taxonomy application to further enhance and organize the extracted features and resulting documents.

When defining a collection’s source they can use buckets or existing collections for downstream feature extraction.

Bucket Details

Key Concepts

Collection Schema

Collections maintain schema consistency across all their documents. This schema is determined by the pipeline that processes objects and writes to the collection.

// Example document in a collection
{
  "document_id": "doc_pqr678",
  "collection_id": "col_mno345",
  "source_object_id": "obj_ghi789",
  
  // System metadata fields
  "__fully_enriched": true,
  "__missing_features": [],
  "__pipeline_version": 1,
  
  // Document content (determined by pipeline)
  "title": "Red Running Shoes",
  "description": "Lightweight running shoes with cushioned soles for maximum comfort...",
  "detected_objects": ["shoe", "footwear", "red", "sports equipment"],
  "product_category": "footwear",
  "price": 89.99,
  "brand": "SportStep",
  
  // Timestamps
  "created_at": "2023-05-10T14:22:00Z",
  "updated_at": "2023-05-10T14:22:00Z"
}

System Metadata Fields

All documents in collections include these standard system metadata fields:

FieldTypeDescription
__fully_enrichedbooleanIndicates if all expected features have been successfully extracted
__missing_featuresarrayLists any features that failed to extract during processing
__pipeline_versionintegerVersion of the pipeline that processed this document
source_object_idstringReference to the source object in a bucket

Populating Collections

Collections are populated by running feature extractors that process objects and output structured documents.

Once you upload an object, the connected downstream collection(s) will invoke their feature extractor processes. This sequence of actions continues until they all succeed or fail.

Marketing Bucket

Video Scenes Collection

Scene Detection

Transcription

Visual Analysis

Documents

Documents are the structured outputs stored in collections after processing objects through pipelines. Each document maintains a reference to its source object and contains extracted features and metadata.

Bucket Details

Limitations

  • Schema Immutability: Once a collection schema is defined, it cannot be modified without creating a new collection
  • Processing Dependencies: Collections depend on the availability and reliability of their configured feature extractors
  • Consistency Requirements: All documents in a collection must conform to the same schema
  • Feature Extraction Failures: If critical features fail to extract, documents will be marked as incomplete
  • Query Performance: Very large collections may require optimization through indexing for optimal performance
  • Cross-Collection Querying: Native joining between collections with different schemas is limited
  • Update Constraints: Documents in collections are primarily designed for append operations rather than frequent updates