Skip to main content
Clustering groups similar documents to power discovery, analytics visuals, and taxonomy bootstrapping. Mixpeek runs clustering as an engine pipeline, stores artifacts (Parquet) for scale, and exposes APIs to create, execute, list, stream, and apply enrichments. New: Automated triggers let you schedule clustering with cron expressions, intervals, events, or conditions—keeping your clusters fresh without manual intervention.

Overview

Structure Discovery

Find groups in one or more collections; optional hierarchical metadata

Scalable Visualization

Landmark-based UMAP generates 2D/3D coords for millions of points

Artifacts & Scale

Results saved as Parquet (centroids, members) for WebGL/Arrow pipelines

LLM Labeling

Optionally name clusters, summaries, and keywords

Enrichment

Write back cluster_id membership or create derived collections

Automated Triggers

Schedule clustering with cron, intervals, events, or conditions

Event-Driven

Auto-recluster when documents added or data changes significantly

Distributed Processing

Ray map_batches processes millions of points with bounded memory

How it works

1

Trigger or Execute

Start clustering via manual API call or automated trigger (cron, interval, event, conditional)
2

Preprocess

Optional normalization and dimensionality reduction (UMAP, tSNE, PCA)
3

Cluster

Algorithms like KMeans, DBSCAN, HDBSCAN, Agglomerative, Spectral, GMM, Mean Shift, OPTICS
4

Postprocess

Compute centroids and stats; optional LLM labeling and hierarchical metadata
5

Persist & Notify

Parquet artifacts saved per run_id; webhook events emitted for monitoring
6

Enrich (optional)

Write cluster_id membership and labels back to documents

Multimodal example

Create a cluster definition

  • API: Create Cluster
  • Method: POST
  • Path: /v1/clusters
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_ids": ["col_products_v1"],
    "cluster_name": "products_clip_hdbscan",
    "cluster_type": "vector",
    "vector_config": {
      "feature_extractor_name": "clip_vit_l_14",
      "clustering_method": "hdbscan",
      "hdbscan_parameters": {"min_cluster_size": 10, "min_samples": 5}
    },
    "llm_labeling": {"enabled": true, "model_name": "gpt-4"}
  }'

Execute clustering

  • API: Execute Clustering
  • Method: POST
  • Path: /v1/clusters/execute
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/execute \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_ids": ["col_products_v1"],
    "config": {
      "algorithm": "kmeans",
      "algorithm_params": {"n_clusters": 8, "max_iter": 300},
      "feature_vector": {"feature_address": {"extractor": "clip_vit_l_14", "version": "1.0.0"}},
      "normalize_features": true,
      "dimensionality_reduction": {"method": "umap", "n_components": 2}
    },
    "sample_size": 10000,
    "store_results": true,
    "include_members": false,
    "compute_metrics": true,
    "save_artifacts": true
  }'
Response includes run_id, metrics, and centroid summaries. Artifacts are written under a per‑run S3 prefix.

Automated clustering with triggers

Instead of manually executing clustering, you can define triggers that automatically run clustering jobs based on schedules, events, or conditions. This ensures your clusters stay fresh without manual intervention.
Triggers are perfect for production workflows where data constantly changes—nightly reclustering, event-driven updates, or condition-based refreshes.

Why use triggers?

Stay Fresh

Keep clusters up-to-date as new documents arrive

Hands-Off

Set once, runs automatically—no manual execution needed

Resource Efficient

Schedule during off-peak hours or when data changes significantly

Production Ready

Built-in failure handling, webhooks, and execution history

Trigger types

Mixpeek supports four trigger types, each suited for different use cases:
  • Cron Triggers
  • Interval Triggers
  • Event Triggers
  • Conditional Triggers
Execute clustering at specific times using cron expressions.Perfect for:
  • Nightly reclustering at 2am
  • Weekly batch processing every Sunday
  • Month-end analytics
curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "execution_config": {
      "collection_ids": ["col_products"],
      "config": {
        "algorithm": "kmeans",
        "algorithm_params": {"n_clusters": 10}
      }
    },
    "trigger_type": "cron",
    "schedule_config": {
      "cron_expression": "0 2 * * *",
      "timezone": "America/New_York"
    },
    "description": "Nightly product clustering at 2am EST"
  }'
Common cron expressions:
  • "0 2 * * *" - Daily at 2:00am
  • "0 */6 * * *" - Every 6 hours
  • "0 0 * * 0" - Every Sunday at midnight
  • "30 14 1 * *" - First day of month at 2:30pm

Managing triggers

Temporarily stop a trigger without deleting it:
# Pause trigger
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/pause \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123"

# Resume trigger
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/resume \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123"
When resumed, the next execution time is recalculated from the current time.
Modify schedule configuration without recreating the trigger:
curl -X PATCH https://api.mixpeek.com/v1/clusters/triggers/trig_abc123 \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "schedule_config": {
      "cron_expression": "0 3 * * *"
    },
    "description": "Updated to 3am"
  }'
Note: Trigger type is immutable—delete and recreate to change type.
Track all executions of a trigger with detailed metrics:
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/trig_abc123/history \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "offset": 0,
    "limit": 50,
    "status_filter": "completed"
  }'
Response includes:
  • Job IDs and execution times
  • Status (completed/failed)
  • Execution duration
  • Cluster count and document count
Find triggers by cluster, type, or status:
curl -X POST https://api.mixpeek.com/v1/clusters/triggers/list \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_id": "cl_xyz789",
    "trigger_type": "cron",
    "status": "active",
    "offset": 0,
    "limit": 50
  }'

Trigger lifecycle and status

1

Active

Trigger is enabled and will fire according to schedule. This is the normal operating state.
2

Paused

Trigger is temporarily disabled but retains configuration. Use pause endpoint to enter this state.
3

Failed

Trigger automatically disabled after 5 consecutive failures. Requires manual resume after fixing issues.
4

Disabled

Trigger soft-deleted via DELETE endpoint. No longer executes but history is preserved.

Failure handling and recovery

Triggers include built-in resilience:
  • Single failures: Logged but trigger continues
  • Consecutive failures: Tracked in trigger metadata
  • 5 consecutive failures: Trigger status changes to failed, requires manual resume
  • Webhook notifications: Sent for each failure with error details
Recovery steps:
  1. Check last_execution_error field via GET endpoint
  2. Fix underlying issue (e.g., invalid config, missing resources)
  3. Update trigger if needed with PATCH endpoint
  4. Resume trigger with POST to /resume endpoint

Webhook events

Triggers emit lifecycle events you can subscribe to:
  • trigger.created - New trigger created
  • trigger.fired - Trigger fired and created clustering job
  • trigger.execution.completed - Clustering job completed successfully
  • trigger.execution.failed - Clustering job failed
  • trigger.paused / trigger.resumed - State changes
  • trigger.deleted - Trigger removed
Subscribe via the webhooks API to build custom workflows.

Example: Complete automation workflow

Here’s a production-ready setup for a product catalog:
1

Create nightly reclustering

curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_products" \
  -H "Content-Type: application/json" \
  -d '{
    "execution_config": {
      "collection_ids": ["col_catalog"],
      "config": {
        "algorithm": "kmeans",
        "algorithm_params": {"n_clusters": 20},
        "normalize_features": true,
        "llm_labeling": {"enabled": true}
      }
    },
    "trigger_type": "cron",
    "schedule_config": {
      "cron_expression": "0 2 * * *",
      "timezone": "America/New_York"
    },
    "description": "Daily product reclustering with labels"
  }'
2

Add event-based trigger for rapid changes

curl -X POST https://api.mixpeek.com/v1/clusters/triggers \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_products" \
  -H "Content-Type: application/json" \
  -d '{
    "execution_config": {
      "collection_ids": ["col_catalog"],
      "config": {"algorithm": "hdbscan"}
    },
    "trigger_type": "event",
    "schedule_config": {
      "event_type": "documents_added",
      "event_threshold": 500,
      "cooldown_seconds": 3600
    },
    "description": "Quick recluster after 500 new products"
  }'
3

Monitor with webhooks

Subscribe to trigger.execution.completed events to:
  • Track clustering performance over time
  • Alert on cluster count changes
  • Trigger downstream enrichment pipelines
4

Apply enrichment automatically

When trigger completes, use enrichment API to write cluster_id back to documents for filtering and faceting.

Quotas and limits

  • Max active triggers: 50 per namespace
  • Min interval: 300 seconds (5 minutes)
  • Default cooldown: 300 seconds (5 minutes)
  • Polling interval: 60 seconds
  • Max consecutive failures: 5 (auto-disables trigger)

Stream artifacts for UI

  • API: Stream Cluster Data
  • Method: POST
  • Path: /v1/clusters/{cluster_id}/data
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/CLUSTER_ID/data \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "cluster_id": "cl_123",
    "include_centroids": true,
    "include_members": true,
    "limit": 1000,
    "offset": 0
  }'
Use this to load centroids and members for visualizations (2D/3D reductions, partitions by cluster_id).

Scalable visualization with dimensionality reduction

For large clusters, Mixpeek provides engine-powered visualization that generates 2D/3D coordinates for millions of points using landmark-based dimensionality reduction. This enables interactive scatter plots, WebGL renderers, and spatial exploration at scale.
Visualization uses landmark-based UMAP with Nyström approximation—fitting on 2% of points and interpolating the rest. This scales to millions of points while maintaining quality.

Why use engine visualization?

Scales to Millions

Landmark-based DR processes 1M+ points in minutes

Distributed Processing

Ray map_batches processes 5k points per batch across cluster

Cached & Reusable

S3-cached coordinates with signature-based keys—no redundant computation

Production Ready

Hard cap at 10k forces engine delegation for safety

How it works

1

Request visualization

Client requests coordinates via API. If cluster has >10k points, automatically delegates to engine.
2

Compute signature

Engine generates cache key from cluster ID, method, parameters—ensures idempotence.
3

Check cache

If coordinates already exist in S3 with matching signature, skip computation and return cached URLs.
4

Fit on landmarks

Sample 2% of points (capped at 50k) as landmarks. Fit UMAP or incremental PCA on landmarks only.
5

Transform with Ray

Use Ray map_batches to apply dimensionality reduction to all points in 5k batches—distributed and memory-efficient.
6

Cache and return

Save coordinates to S3 as Parquet. Return URLs to API, which loads and serves to client.

Generate visualization

For datasets with >10k points, visualization is generated automatically when you request artifacts. For smaller datasets, call the engine endpoint directly.
  • Automatic (>10k points)
  • Manual (<10k points)
# For large clusters, visualization is automatic
curl https://api.mixpeek.com/v1/clusters/cl_abc123/artifacts?include_members=true \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123"
Response includes coordinates:
{
  "centroids": [...],
  "members": [
    {
      "point_id": "doc_1",
      "cluster_id": "cluster_0",
      "x": 1.5,
      "y": 2.3,
      "payload": {}
    }
  ]
}

Configuration options

UMAP (Uniform Manifold Approximation and Projection)
  • Best for: Preserving both local and global structure
  • Quality: Excellent cluster separation
  • Speed: Moderate (60s for 1M points)
  • Use when: Visualization quality matters most
Incremental PCA
  • Best for: Fast linear projections
  • Quality: Good for high-dimensional data
  • Speed: Fast (30s for 1M points)
  • Use when: Speed matters more than non-linear structure
{
  "method": "umap",        // or "ipca"
  "n_components": 2        // 2D or 3D
}
Controls how many points are used as landmarks for fitting:
  • sample_pct: Percentage of points to use as landmarks (default: 0.02 = 2%)
  • max_landmarks: Cap on landmark count (default: 50,000)
  • k_landmarks: Nearest landmarks for interpolation (default: 15)
Examples:
  • 10k points → 200 landmarks (2%)
  • 100k points → 2k landmarks (2%)
  • 1M points → 20k landmarks (2%)
  • 10M points → 50k landmarks (capped)
{
  "sample_pct": 0.02,
  "max_landmarks": 50000,
  "k_landmarks": 15
}
Tuning guide:
  • More landmarks = better quality, slower processing
  • Fewer landmarks = faster processing, lower quality
  • Sweet spot: 0.02 (2%) for most use cases
Fine-tune UMAP behavior for different visualization needs:
  • n_neighbors: Balance local vs global structure (default: 15)
  • min_dist: Minimum distance between points (default: 0.1)
  • metric: Distance metric (default: “cosine”)
Common configurations:Tight clusters (detail view):
{
  "umap_n_neighbors": 30,
  "umap_min_dist": 0.01,
  "umap_metric": "cosine"
}
Spread out (overview):
{
  "umap_n_neighbors": 10,
  "umap_min_dist": 0.3,
  "umap_metric": "euclidean"
}
High-dimensional embeddings:
{
  "umap_n_neighbors": 20,
  "umap_min_dist": 0.1,
  "umap_metric": "cosine"  // Best for embeddings
}
Visualization results are cached in S3 with signature-based keys:Signature includes:
  • Cluster ID
  • Feature name
  • DR method and parameters
  • Component count
Behavior:
  • Same parameters → same signature → cached result
  • Different parameters → new signature → new computation
  • Set force_recompute: true to bypass cache
{
  "force_recompute": false  // Use cached if available
}
Cache location:
s3://bucket/{internal_id}/{namespace_id}/
  engine_cluster_build/{cluster_id}/
    visualization/
      coords_{signature}.parquet
      reducer_{signature}.pkl

Performance characteristics

  • Processing Times
  • Memory Usage
  • Quality Metrics
Approximate times for landmark-based UMAP on typical hardware:
Dataset SizeLandmarksFit TimeTransform TimeTotal
1k points1002s0.5s2.5s
10k points5005s2s7s
100k points2k15s10s25s
1M points20k60s60s2m
10M points50k120s300s7m
Notes:
  • Fit time scales with landmark count squared
  • Transform time scales with point count × k_landmarks
  • Times assume 16-core machine

Integration with visualization tools

WebGL Scatter Plots

Use Three.js, Plotly, or deck.gl to render millions of points with GPU acceleration

Observable Plots

Load Parquet directly in Observable notebooks for interactive exploration

Apache Arrow

Use PyArrow or Arrow.js to stream coordinates without full deserialization

Tile-Based Rendering

Bin coordinates into tiles for progressive loading in zoom-enabled UIs
Example: Load in JavaScript
import { tableFromIPC } from 'apache-arrow';

// Fetch coordinates from API
const response = await fetch('/v1/clusters/cl_abc/artifacts?include_members=true');
const { members } = await response.json();

// Render with Three.js, Plotly, etc.
const scene = new THREE.Scene();
members.forEach(point => {
  const geometry = new THREE.SphereGeometry(0.05);
  const material = new THREE.MeshBasicMaterial({ 
    color: getColorForCluster(point.cluster_id) 
  });
  const sphere = new THREE.Mesh(geometry, material);
  sphere.position.set(point.x, point.y, point.z || 0);
  scene.add(sphere);
});

Troubleshooting

Cause: Cluster has fewer than 10k points, no pre-generated visualization.Solution:
# Generate visualization first
POST /clusters/visualization
{
  "cluster_id": "cl_abc123",
  "method": "umap",
  "n_components": 2
}

# Then request artifacts
GET /clusters/cl_abc123/artifacts?include_members=true
Symptoms: Clusters overlap, structure is unclear, points are too spread out.Solutions:
  1. Increase landmarks: Set sample_pct: 0.03 or 0.05
  2. Adjust n_neighbors: Try 20-30 for more global structure
  3. Tighten clusters: Set min_dist: 0.01 for less overlap
  4. Check metric: Use "cosine" for embeddings, "euclidean" for raw features
  5. Try 3D: Set n_components: 3 for complex structures
Symptoms: Visualization takes too long for interactive use.Solutions:
  1. Reduce landmarks: Set sample_pct: 0.01 (1%)
  2. Use IPCA: Set method: "ipca" for 2x speedup
  3. Check Ray cluster: Ensure sufficient CPUs allocated
  4. Verify caching: Ensure force_recompute: false to use cache
  5. Increase batch size: Larger batches = fewer overhead (tune in engine config)
Symptoms: Engine returns error during visualization generation.Debugging steps:
  1. Check engine logs for detailed error message
  2. Verify members.parquet exists in S3 for the cluster
  3. Try with force_recompute: true to bypass cache
  4. Check Ray cluster health and resource availability
  5. Verify feature vectors are present in member data
Common causes:
  • Missing or corrupted member artifacts
  • Insufficient memory for landmark count
  • Ray actor failures (check Ray dashboard)

Apply cluster enrichment

  • API: Apply Cluster Enrichment
  • Method: POST
  • Path: /v1/clusters/enrich
  • Reference: API Reference
curl -X POST https://api.mixpeek.com/v1/clusters/enrich \
  -H "Authorization: Bearer $API_KEY" \
  -H "X-Namespace: ns_123" \
  -H "Content-Type: application/json" \
  -d '{
    "clustering_ids": ["cl_run_abc"],
    "source_collection_id": "col_products_v1",
    "target_collection_id": "col_products_enriched_v1",
    "batch_size": 1000,
    "parallelism": 4
  }'
This writes cluster_id membership (and optional labels) back to documents.

Manage clusters

Config building blocks

  • Algorithms
  • Reduction & Normalization
  • LLM Labeling
  • Hierarchy
kmeans, dbscan, hdbscan, agglomerative, spectral, gaussian_mixture, mean_shift, optics

Artifacts (Parquet)

Columns include: cluster_id, centroid_vector, num_members, variance, label, summary, keywords, feature_name, feature_dimensions, parent_cluster_id, hierarchy_level, reduction_method, parameters, algorithm, run_id, timestamps
Partitioned by cluster_id; includes point_id, reduced coordinates (x, y, optional z), and optional payload slice for filtering

Best practices

1

Start with samples

Use sample_size for quick exploration before full runs. Test clustering algorithms and parameters on a subset before scaling up.
2

Automate with triggers

Set up cron triggers for nightly reclustering and event triggers for rapid data changes. This keeps clusters fresh without manual work.
3

Label pragmatically

Enable LLM labeling after validating cluster separation. Labels are expensive—ensure your clusters are meaningful first.
4

Monitor with webhooks

Subscribe to trigger execution events to track performance, alert on anomalies, and chain downstream workflows automatically.
5

Persist visuals

Save artifacts and stream reduced coordinates for UI at scale. Parquet format enables fast loading in WebGL and data viz tools.
6

Close the loop

Convert stable clusters into taxonomies to enrich downstream search. Apply enrichment to write cluster_id back for filtering and faceting.

See also

I