Skip to main content
Cross Compare stage showing multi-tier matching cascade between source and reference collections
The Cross Compare stage compares source documents against a reference collection using a cascading match strategy: exact → fuzzy → semantic → visual. Each match is classified using configurable rules, enabling drift detection, deduplication, and compliance checking workflows.
Stage Category: APPLY (Cross-collection comparison)Transformation: N documents → M finding documents (findings mode) or N documents → N enriched documents (enrich mode)

When to Use

Use CaseDescription
Content drift detectionCompare video UI against documentation to find outdated content
Product catalog matchingMatch supplier products against internal catalog
Content deduplicationCheck new content against existing corpus
Compliance checkingVerify content against requirements or standards
Cross-reference validationValidate labels, features, or terms across sources

When NOT to Use

ScenarioRecommended Alternative
Simple field joinsdocument_enrich
External API enrichmentapi_call
Single-collection filteringattribute_filter or feature_search
Semantic similarity searchfeature_search

Parameters

Core Parameters

ParameterTypeDefaultDescription
reference_collection_idstringRequiredCollection containing reference documents to compare against
source_fieldstringcontentField on source documents to extract comparison elements from
reference_fieldstringcontentField on reference documents containing comparison content
extraction_modestringrawHow to extract elements: raw, lines, labels, or list

Matching Configuration

ParameterTypeDefaultDescription
match_tiersstring[]["exact", "fuzzy"]Ordered matching cascade. Stops at first successful match.
fuzzy_thresholdfloat0.75Minimum fuzzy score to accept a match
semantic_thresholdfloat0.85Minimum semantic similarity to accept
visual_thresholdfloat0.55Minimum visual similarity to accept

Classification

ParameterTypeDefaultDescription
classificationsobject[]See belowScore-to-label mapping rules (evaluated in order)
no_match_labelstringno_matchLabel when no tier matches
Default classification rules:
[
  {"min_score": 0.95, "label": "exact_match"},
  {"min_score": 0.85, "label": "close_match"},
  {"min_score": 0.65, "label": "partial_match"},
  {"min_score": 0.0, "label": "no_match"}
]

Output Configuration

ParameterTypeDefaultDescription
output_modestringfindingsfindings (N-to-M) or enrich (1-to-1)
output_fieldstringcomparison_resultsField name for results in enrich mode

Visual Comparison

ParameterTypeDefaultDescription
include_visual_comparisonbooleanfalseEnable visual embedding comparison
text_vector_indexstringintfloat__multilingual_e5_large_instructVector index for semantic matching
image_vector_indexstringgoogle__siglip_base_patch16_224SigLIP vector index
structure_vector_indexstringfacebook__dinov2_baseDINOv2 vector index
dinov2_weightfloat0.7Weight for DINOv2 in combined visual score
siglip_weightfloat0.3Weight for SigLIP in combined visual score

Reference & Source Configuration

ParameterTypeDefaultDescription
reference_limitinteger200Max reference documents to fetch
reference_doc_typestringnullFilter reference docs by doc_type
source_location_fieldstringstart_timeField containing location reference (timestamp, page)
source_doc_type_filterstringnullOnly process source docs with this doc_type
filter_generic_labelsbooleantrueFilter generic UI labels in labels mode

Extraction Modes

Use the field value as a single element. Best for comparing whole content blocks.
{"extraction_mode": "raw"}

Matching Cascade

The matching cascade tries each tier in order and stops at the first successful match:
For each source element:
  ├─ exact:    Case-insensitive string match → score = 1.0
  ├─ fuzzy:    SequenceMatcher ratio ≥ fuzzy_threshold
  ├─ semantic: Vector similarity ≥ semantic_threshold
  └─ visual:   DINOv2 + SigLIP similarity ≥ visual_threshold
If no tier matches, the element receives match_tier: "none" and the no_match_label classification.

Configuration Examples

{
  "stage_type": "apply",
  "stage_id": "cross_compare",
  "parameters": {
    "reference_collection_id": "col_documentation",
    "source_field": "content",
    "reference_field": "content",
    "extraction_mode": "labels",
    "match_tiers": ["exact", "fuzzy", "semantic"],
    "include_visual_comparison": true,
    "source_doc_type_filter": "scene",
    "source_location_field": "start_time",
    "classifications": [
      {"min_score": 0.95, "label": "current"},
      {"min_score": 0.75, "label": "needs_review"},
      {"min_score": 0.0, "label": "outdated"}
    ]
  }
}

Output Schema

Findings Mode

Each comparison produces a finding document:
{
  "element_type": "text",
  "source_content": "Configure API Keys",
  "source_location": "00:01:23",
  "reference_match": "API Key Configuration",
  "reference_url": "https://docs.example.com/api-keys",
  "match_tier": "fuzzy",
  "match_score": 0.87,
  "classification": "close_match",
  "confidence": 0.92,
  "signals": {
    "context_match": true,
    "workflow_match": false,
    "transcript_match": true
  }
}

Enrich Mode

Comparison results attached as a field on source documents:
{
  "document_id": "doc_source_123",
  "content": "...",
  "comparison_results": [
    {
      "element_type": "text",
      "source_content": "Configure API Keys",
      "match_tier": "fuzzy",
      "match_score": 0.87,
      "classification": "close_match",
      "confidence": 0.92
    }
  ]
}

Finding Fields

FieldTypeDescription
element_typestringType of element: text, code, visual, or custom
source_contentstringContent from the source document
source_locationstringLocation reference (timestamp, page number)
reference_matchstringBest matching content from reference
reference_urlstringURL or ID of matched reference document
match_tierstringTier used: exact, fuzzy, semantic, visual, none
match_scorefloatMatch score (0.0 - 1.0)
classificationstringLabel from classification rules
confidencefloatMulti-signal confidence (0.0 - 1.0)
signalsobjectCorroborating signals used in confidence

Performance

ScenarioExpected LatencyNotes
Exact + fuzzy only (50 docs)50-200msIn-memory string matching
With semantic tier (50 docs)200-500msQdrant vector queries
With visual comparison (50 docs)500-1500msMultiple vector queries
Large reference set (200 docs)300-800msMore candidates to compare
Reference documents are fetched once and reused across all source documents. The matching cascade short-circuits at the first successful tier, so ordering match_tiers from fastest to slowest (exact → fuzzy → semantic → visual) is optimal.
Limits:
  • Max source documents per execution: 50
  • Max reference documents fetched: 200 (configurable via reference_limit)

Common Pipeline Patterns

Drift Detection Pipeline

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "features": [{
        "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
        "query_input": "{{INPUT.query}}",
        "top_k": 50
      }],
      "final_top_k": 50
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "cross_compare",
    "parameters": {
      "reference_collection_id": "col_documentation",
      "source_field": "content",
      "reference_field": "content",
      "extraction_mode": "labels",
      "match_tiers": ["exact", "fuzzy", "semantic"],
      "include_visual_comparison": true,
      "classifications": [
        {"min_score": 0.95, "label": "current"},
        {"min_score": 0.75, "label": "needs_review"},
        {"min_score": 0.0, "label": "outdated"}
      ]
    }
  }
]

Catalog Match + Transform

[
  {
    "stage_type": "filter",
    "stage_id": "feature_search",
    "parameters": {
      "features": [{
        "feature_uri": "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1",
        "query_input": "{{INPUT.query}}",
        "top_k": 30
      }],
      "final_top_k": 30
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "cross_compare",
    "parameters": {
      "reference_collection_id": "col_reference_catalog",
      "source_field": "product_name",
      "reference_field": "product_name",
      "match_tiers": ["exact", "fuzzy"],
      "output_mode": "enrich",
      "output_field": "match_result"
    }
  },
  {
    "stage_type": "apply",
    "stage_id": "json_transform",
    "parameters": {
      "template": "{\"product\": \"{{ DOC.product_name }}\", \"match_status\": \"{{ DOC.match_result[0].classification }}\", \"score\": {{ DOC.match_result[0].match_score }}}"
    }
  }
]

Error Handling

ErrorBehavior
Reference collection not foundStage fails with error
No reference documents foundAll elements classified as no_match
Vector index not availableSemantic/visual tiers skipped silently
Source field missing on documentDocument skipped
Exceeds max_working_documentsExtra documents passed through unchanged

vs Other Enrichment Stages

Featurecross_comparedocument_enrichapi_call
PurposeMulti-tier comparison with classificationSimple field join/lookupExternal API enrichment
Data sourceInternal Qdrant collectionsInternal Qdrant collectionsExternal HTTP APIs
MatchingCascading: exact → fuzzy → semantic → visualTop-1 vector or key matchN/A
OutputClassified findings with scoresJoined fieldsAPI response
Latency50-1500ms5-20ms100-500ms
Best forDrift detection, dedup, complianceCross-collection joinsThird-party data