Organize processed documents with consistent schemas for efficient retrieval
Collections store processed documents with a consistent schema. They are the primary containers for structured data that has been extracted from raw objects.
When raw objects are processed by feature extractors, the resulting structured data is stored in collections as documents.
Select Source
Choose whether to use a bucket or another collection as your data source. Buckets provide raw files, while collections offer already processed documents.
Select Feature Extractors
Determine which feature extractors will process your source data. These extractors define what information will be derived from your content.
Pass Properties from Source
Specify which properties from your source (blobs from buckets or fields from collections) should be passed to your feature extractors.
Configure Feature Extractors
Customize settings for each feature extractor to optimize their performance for your specific use case and content types.
Configure Processing
Set up additional processing like taxonomy application to further enhance and organize the extracted features and resulting documents.
When defining a collection’s source they can use buckets or existing collections for downstream feature extraction.
Schema Consistency
All documents in a collection share the same schema. Schemas are immutable, and are defined based on the feature extractors used.
Source Configuration
Collections can use buckets or other collections as their source. The source you select makes various properties (blobs if a bucket or documents if a collection) available that can optionally get passed into feature extractors.
Collections maintain schema consistency across all their documents. This schema is determined by the pipeline that processes objects and writes to the collection.
All documents in collections include these standard system metadata fields:
Field | Type | Description |
---|---|---|
__fully_enriched | boolean | Indicates if all expected features have been successfully extracted |
__missing_features | array | Lists any features that failed to extract during processing |
__pipeline_version | integer | Version of the pipeline that processed this document |
source_object_id | string | Reference to the source object in a bucket |
Collections are populated by running feature extractors that process objects and output structured documents.
Once you upload an object, the connected downstream collection(s) will invoke their feature extractor processes. This sequence of actions continues until they all succeed or fail.
Documents are the structured outputs stored in collections after processing objects through pipelines. Each document maintains a reference to its source object and contains extracted features and metadata.
Collections support caching to improve performance and reduce computational overhead. For detailed information about caching configuration and best practices, see the Caching documentation.
Organize processed documents with consistent schemas for efficient retrieval
Collections store processed documents with a consistent schema. They are the primary containers for structured data that has been extracted from raw objects.
When raw objects are processed by feature extractors, the resulting structured data is stored in collections as documents.
Select Source
Choose whether to use a bucket or another collection as your data source. Buckets provide raw files, while collections offer already processed documents.
Select Feature Extractors
Determine which feature extractors will process your source data. These extractors define what information will be derived from your content.
Pass Properties from Source
Specify which properties from your source (blobs from buckets or fields from collections) should be passed to your feature extractors.
Configure Feature Extractors
Customize settings for each feature extractor to optimize their performance for your specific use case and content types.
Configure Processing
Set up additional processing like taxonomy application to further enhance and organize the extracted features and resulting documents.
When defining a collection’s source they can use buckets or existing collections for downstream feature extraction.
Schema Consistency
All documents in a collection share the same schema. Schemas are immutable, and are defined based on the feature extractors used.
Source Configuration
Collections can use buckets or other collections as their source. The source you select makes various properties (blobs if a bucket or documents if a collection) available that can optionally get passed into feature extractors.
Collections maintain schema consistency across all their documents. This schema is determined by the pipeline that processes objects and writes to the collection.
All documents in collections include these standard system metadata fields:
Field | Type | Description |
---|---|---|
__fully_enriched | boolean | Indicates if all expected features have been successfully extracted |
__missing_features | array | Lists any features that failed to extract during processing |
__pipeline_version | integer | Version of the pipeline that processed this document |
source_object_id | string | Reference to the source object in a bucket |
Collections are populated by running feature extractors that process objects and output structured documents.
Once you upload an object, the connected downstream collection(s) will invoke their feature extractor processes. This sequence of actions continues until they all succeed or fail.
Documents are the structured outputs stored in collections after processing objects through pipelines. Each document maintains a reference to its source object and contains extracted features and metadata.
Collections support caching to improve performance and reduce computational overhead. For detailed information about caching configuration and best practices, see the Caching documentation.