Submit Clustering Job

Authorizations

Authorization

string

header

required

Bearer token authentication using your API key. Format: 'Bearer your_api_key'. To get an API key, create an account at mixpeek.com/start and generate a key in your account settings.

Headers

Authorization

string

required

REQUIRED: Bearer token authentication using your API key. Format: 'Bearer sk_xxxxxxxxxxxxx'. You can create API keys in the Mixpeek dashboard under Organization Settings.

Examples:

"Bearer sk_live_abc123def456"

"Bearer sk_test_xyz789"

X-Namespace

string

required

REQUIRED: Namespace identifier for scoping this request. All resources (collections, buckets, taxonomies, etc.) are scoped to a namespace. You can provide either the namespace name or namespace ID. Format: ns_xxxxxxxxxxxxx (ID) or a custom name like 'my-namespace'

Examples:

"ns_abc123def456"

"production"

"my-namespace"

Query Parameters

cluster_id

string | null

Optional cluster_id to link job to cluster doc

Body

application/json

Request to execute clustering on one or more collections.

collection_ids

string[]

required

IDs of the collections to cluster together

Minimum length: 1

config

object

required

Clustering configuration including algorithm and parameters

Show child attributes

Examples:

{
  "algorithm": "kmeans",
  "algorithm_params": {
    "max_iter": 300,
    "n_clusters": 5,
    "random_state": 42
  },
  "description": "Vector-based clustering with K-means",
  "feature_vector": {
    "feature_address": "mixpeek://text_extractor@v1/text_extractor_v1_embedding"
  },
  "llm_labeling": {
    "enabled": true,
    "model_name": "gpt-4o-mini",
    "provider": "openai"
  },
  "normalize_features": true
}

{
  "algorithm": "hdbscan",
  "algorithm_params": { "min_cluster_size": 10, "min_samples": 5 },
  "description": "Vector-based clustering with HDBSCAN",
  "feature_vector": {
    "feature_address": "mixpeek://image_extractor@v1/image_extractor_v1_embedding"
  },
  "normalize_features": false
}

{
  "algorithm": "attribute_based",
  "attribute_config": {
    "attributes": ["category"],
    "hierarchical_grouping": false
  },
  "description": "Attribute-based clustering (simple category)",
  "llm_labeling": {
    "enabled": true,
    "include_keywords": true,
    "include_summary": true,
    "provider": "openai"
  }
}

{
  "algorithm": "attribute_based",
  "attribute_config": {
    "aggregation_method": "most_frequent",
    "attributes": ["category", "brand"],
    "hierarchical_grouping": true
  },
  "description": "Attribute-based clustering (hierarchical category → brand)"
}

{
  "algorithm": "attribute_based",
  "attribute_config": {
    "attributes": ["metadata.status", "metadata.priority"],
    "hierarchical_grouping": false
  },
  "description": "Attribute-based clustering (nested attributes)"
}

namespace_id

string | null

Namespace ID for the request

internal_id

string | null

Internal ID for the request

sample_size

integer | null

Number of documents to sample for clustering

store_results

boolean

default:true

Whether to store clustering results

include_members

boolean

default:false

Whether to include cluster membership in results

compute_metrics

boolean

default:true

Whether to compute clustering quality metrics

save_artifacts

boolean

default:false

Whether to save clustering artifacts (e.g., parquet) to S3

Response

Successful Response

Task response model returned by the API.

Extends TaskModel with additional convenience fields for API responses. This is the model returned when you GET /v1/tasks/{task_id}.

Additional Fields: error_message: Convenience field that surfaces errors from additional_data for easier error handling in client code.

Inheritance: Inherits all fields and documentation from TaskModel, including: - task_id: Unique identifier - task_type: Operation type - status: Current status - inputs: Input parameters - outputs: Output results - additional_data: Metadata and context

Storage Architecture: Same as TaskModel - stored in Redis (24hr TTL) with MongoDB fallback.

Usage: This model is automatically returned by task API endpoints. You don't need to construct it manually - just call GET /v1/tasks/{task_id}.

Error Handling: Check the error_message field for a user-friendly error string, or additional_data['error'] for the full error details.

Example Response: { "task_id": "task_abc123", "task_type": "api_buckets_batches_process", "status": "FAILED", "inputs": ["batch_xyz"], "outputs": null, "additional_data": { "error": "Failed to process batch: Object not found", "batch_id": "batch_xyz" }, "error_message": "Failed to process batch: Object not found" }

task_id

string

required

Unique identifier for the task. REQUIRED. Used to poll task status via GET /v1/tasks/{task_id}. This ID is also stored on parent resources (batches, clusters, etc.) for cross-referencing. Format: UUID v4 or custom string identifier.

Examples:

"task_abc123def456"

"550e8400-e29b-41d4-a716-446655440000"

task_type

enum<string>

required

Type of operation this task represents. REQUIRED. Identifies the specific async operation being performed. Used for filtering and categorizing tasks. Common types: api_buckets_batches_process, engine_cluster_build, api_taxonomies_execute. See TaskType enum for complete list of supported operations.

Available options:

api_namespaces_create,

api_buckets_objects_create,

api_buckets_delete,

api_buckets_batches_process,

api_buckets_batches_submit,

api_buckets_uploads_create,

api_buckets_uploads_confirm,

api_buckets_uploads_batch_confirm,

api_taxonomies_create,

api_taxonomies_execute,

api_taxonomies_materialize,

engine_feature_extractor_run,

engine_inference_run,

engine_object_processing,

engine_cluster_build,

thumbnail,

video_segment,

materialize

status

enum<string>

required

Current status of the task. REQUIRED. Indicates the current state of the async operation. Terminal statuses (COMPLETED, FAILED, CANCELED) indicate the task has finished and will not change. Active statuses (PENDING, IN_PROGRESS, PROCESSING) indicate the task is still running and should be polled. Use this field to determine when to stop polling.

Available options:

PENDING,

IN_PROGRESS,

PROCESSING,

COMPLETED,

COMPLETED_WITH_ERRORS,

FAILED,

CANCELED,

UNKNOWN,

SKIPPED,

DRAFT,

ACTIVE,

ARCHIVED,

SUSPENDED

inputs

Inputs · array

Input parameters or data used to start the task. OPTIONAL. May include IDs, configuration objects, or file references. Useful for debugging and understanding what data the task processed. Format: List of strings (IDs) or objects (configuration). Example: ['batch_id_123'] or [{'bucket_id': 'bkt_abc', 'config': {...}}]

Show child attributes

Examples:

["batch_xyz789"]

["obj_123", "obj_456", "obj_789"]

[
  {
    "bucket_id": "bkt_abc",
    "collection_ids": ["col_1", "col_2"]
  }
]

outputs

Outputs · array

Output results produced by the task. OPTIONAL. Populated when task completes successfully. May include processed file IDs, result metrics, or status summaries. Check this field after task reaches COMPLETED status to get results. Format: List of strings (output IDs) or objects (result data).

Show child attributes

Examples:

["document_123", "document_456"]

[
  {
    "failed_count": 2,
    "processed_count": 100,
    "success_rate": 0.98
  }
]

[
  {
    "cluster_id": "cl_abc123",
    "num_clusters": 5
  }
]

additional_data

object | null

Additional metadata and context for the task. OPTIONAL. Contains job IDs, error details, progress info, and other task-specific metadata.

Common fields (all task types): - 'error': Error message if task failed - 'job_id': Ray job ID for engine tasks - 'from_mongodb': True if retrieved from MongoDB fallback (not Redis)

Batch-specific fields (task_type=api_buckets_batches_process): - 'batch_id': Batch identifier (REQUIRED) - 'bucket_id': Source bucket identifier (REQUIRED) - 'namespace_id': Namespace identifier (REQUIRED) - 'current_tier': Currently processing tier number, 0-indexed (OPTIONAL, None if not started) - 'total_tiers': Total number of tiers in the batch pipeline (REQUIRED) - 'collection_ids': Array of ALL collection IDs across all tiers (REQUIRED) - 'object_count': Number of objects being processed (REQUIRED) - 'sample_object_ids': First 5 object IDs for debugging/display (OPTIONAL)

Performance Note: Full object_ids array is NOT stored in task metadata to avoid bloating task documents (batches with 10k+ objects would add 200KB+ per task). For full object list, query the batch directly via GET /v1/buckets/{bucket_id}/batches/{batch_id}.

Note: For detailed per-tier status, use GET /v1/buckets/{bucket_id}/batches/{batch_id} to access the tier_tasks[] array which contains individual tier statuses, collection_ids, and timestamps for each tier.

Examples:

{
  "batch_id": "btch_xyz789",
  "bucket_id": "bkt_products",
  "collection_ids": ["col_tier0", "col_tier1", "col_tier2"],
  "current_tier": 1,
  "job_id": "ray_job_123",
  "namespace_id": "ns_abc123",
  "object_count": 10000,
  "sample_object_ids": [
    "obj_001",
    "obj_002",
    "obj_003",
    "obj_004",
    "obj_005"
  ],
  "total_tiers": 3
}

{
  "error": "Failed to process object: Invalid file format",
  "job_id": "123"
}

{
  "cluster_id": "cl_abc",
  "collection_ids": ["col_1"],
  "from_mongodb": true
}

error_message

string | null

Flattened error message for convenient error handling. OPTIONAL. Automatically populated from additional_data['error'] when the task has FAILED status. This is a convenience field - the full error details are always available in additional_data['error']. Use this field for displaying errors to users or logging. Will be None if task has not failed or if no error details are available.

Examples:

"Failed to process batch: Object not found"

"Invalid file format: Expected PDF, got PNG"

"Clustering failed: Insufficient data points"

null

Health

Organizations

Namespaces

Buckets

Feature Extractors

Collections

Retrievers

Taxonomies

Clusters

Analytics

Tasks

Webhooks

Submit Clustering Job

Authorizations

Headers

Query Parameters

Body

Response