The lists below show all the models currently supported in the SDK for various tasks. Each model is designed to handle a specific type of data or task, such as embedding, reading, describing, or transcribing.

If you plan on storing the raw embeddings yourself see the embedding storage section.

We add models regularly, so if we’re missing any reach out.

Embedding Models

Embedding models convert data into numerical vectors, enabling efficient similarity searches and machine learning tasks.

NameModalityDimensionsDescription
multimodal-v1Text, Image, Video1408Most general purpose model, but slower and can be less precise
clip-v1Text, Image512A versatile model for text and image embeddings
vuse-generic-v1Text, Video768Specialized for text and video embeddings
splade-v3TextN/AFull-text model for text-only embeddings

Read Models

Read models read the text on the visual asset.

NameModalityDescription
video-descriptor-v1VideoExtracts key information and metadata from video content

Describe Models

Describe models generate human-readable descriptions or summaries of input data.

NameModalityDescription
video-descriptor-v1VideoGenerates detailed descriptions of video content

Transcribe Models

Transcribe models convert spoken language into written text.

NameModalityDescription
polyglot-v1AudioTranscribes speech from multiple languages

Was this page helpful?