The Cross Compare stage compares source documents against a reference collection using a cascading match strategy: exact → fuzzy → semantic → visual. Each match is classified using configurable rules, enabling drift detection, deduplication, and compliance checking workflows.
Stage Category : APPLY (Cross-collection comparison)Transformation : N documents → M finding documents (findings mode) or N documents → N enriched documents (enrich mode)
When to Use
Use Case Description Content drift detection Compare video UI against documentation to find outdated content Product catalog matching Match supplier products against internal catalog Content deduplication Check new content against existing corpus Compliance checking Verify content against requirements or standards Cross-reference validation Validate labels, features, or terms across sources
When NOT to Use
Scenario Recommended Alternative Simple field joins document_enrichExternal API enrichment api_callSingle-collection filtering attribute_filter or feature_searchSemantic similarity search feature_search
Parameters
Core Parameters
Parameter Type Default Description reference_collection_idstring Required Collection containing reference documents to compare against source_fieldstring contentField on source documents to extract comparison elements from reference_fieldstring contentField on reference documents containing comparison content extraction_modestring rawHow to extract elements: raw, lines, labels, or list
Matching Configuration
Parameter Type Default Description match_tiersstring[] ["exact", "fuzzy"]Ordered matching cascade. Stops at first successful match. fuzzy_thresholdfloat 0.75Minimum fuzzy score to accept a match semantic_thresholdfloat 0.85Minimum semantic similarity to accept visual_thresholdfloat 0.55Minimum visual similarity to accept
Classification
Parameter Type Default Description classificationsobject[] See below Score-to-label mapping rules (evaluated in order) no_match_labelstring no_matchLabel when no tier matches
Default classification rules:
[
{ "min_score" : 0.95 , "label" : "exact_match" },
{ "min_score" : 0.85 , "label" : "close_match" },
{ "min_score" : 0.65 , "label" : "partial_match" },
{ "min_score" : 0.0 , "label" : "no_match" }
]
Output Configuration
Parameter Type Default Description output_modestring findingsfindings (N-to-M) or enrich (1-to-1)output_fieldstring comparison_resultsField name for results in enrich mode
Visual Comparison
Parameter Type Default Description include_visual_comparisonboolean falseEnable visual embedding comparison text_vector_indexstring intfloat__multilingual_e5_large_instructVector index for semantic matching image_vector_indexstring google__siglip_base_patch16_224SigLIP vector index structure_vector_indexstring facebook__dinov2_baseDINOv2 vector index dinov2_weightfloat 0.7Weight for DINOv2 in combined visual score siglip_weightfloat 0.3Weight for SigLIP in combined visual score
Reference & Source Configuration
Parameter Type Default Description reference_limitinteger 200Max reference documents to fetch reference_doc_typestring nullFilter reference docs by doc_type source_location_fieldstring start_timeField containing location reference (timestamp, page) source_doc_type_filterstring nullOnly process source docs with this doc_type filter_generic_labelsboolean trueFilter generic UI labels in labels mode
Use the field value as a single element. Best for comparing whole content blocks. { "extraction_mode" : "raw" }
Split by newlines. Each line becomes a comparison element. Useful for step-by-step instructions or structured text. { "extraction_mode" : "lines" }
Extract UI/feature labels via pattern matching. Identifies instruction patterns (“Click Settings ”), em-dash separators (“Label — description”), and action labels (“Configure X”). Generic labels like “Save”, “Cancel”, “Next” are filtered by default. { "extraction_mode" : "labels" , "filter_generic_labels" : true }
Field is already a list of elements. Used directly without extraction. { "extraction_mode" : "list" }
Matching Cascade
The matching cascade tries each tier in order and stops at the first successful match:
For each source element:
├─ exact: Case-insensitive string match → score = 1.0
├─ fuzzy: SequenceMatcher ratio ≥ fuzzy_threshold
├─ semantic: Vector similarity ≥ semantic_threshold
└─ visual: DINOv2 + SigLIP similarity ≥ visual_threshold
If no tier matches, the element receives match_tier: "none" and the no_match_label classification.
Configuration Examples
Content Drift Detection
Product Catalog Matching
Content Deduplication
Compliance Checking
{
"stage_type" : "apply" ,
"stage_id" : "cross_compare" ,
"parameters" : {
"reference_collection_id" : "col_documentation" ,
"source_field" : "content" ,
"reference_field" : "content" ,
"extraction_mode" : "labels" ,
"match_tiers" : [ "exact" , "fuzzy" , "semantic" ],
"include_visual_comparison" : true ,
"source_doc_type_filter" : "scene" ,
"source_location_field" : "start_time" ,
"classifications" : [
{ "min_score" : 0.95 , "label" : "current" },
{ "min_score" : 0.75 , "label" : "needs_review" },
{ "min_score" : 0.0 , "label" : "outdated" }
]
}
}
Output Schema
Findings Mode
Each comparison produces a finding document:
{
"element_type" : "text" ,
"source_content" : "Configure API Keys" ,
"source_location" : "00:01:23" ,
"reference_match" : "API Key Configuration" ,
"reference_url" : "https://docs.example.com/api-keys" ,
"match_tier" : "fuzzy" ,
"match_score" : 0.87 ,
"classification" : "close_match" ,
"confidence" : 0.92 ,
"signals" : {
"context_match" : true ,
"workflow_match" : false ,
"transcript_match" : true
}
}
Enrich Mode
Comparison results attached as a field on source documents:
{
"document_id" : "doc_source_123" ,
"content" : "..." ,
"comparison_results" : [
{
"element_type" : "text" ,
"source_content" : "Configure API Keys" ,
"match_tier" : "fuzzy" ,
"match_score" : 0.87 ,
"classification" : "close_match" ,
"confidence" : 0.92
}
]
}
Finding Fields
Field Type Description element_typestring Type of element: text, code, visual, or custom source_contentstring Content from the source document source_locationstring Location reference (timestamp, page number) reference_matchstring Best matching content from reference reference_urlstring URL or ID of matched reference document match_tierstring Tier used: exact, fuzzy, semantic, visual, none match_scorefloat Match score (0.0 - 1.0) classificationstring Label from classification rules confidencefloat Multi-signal confidence (0.0 - 1.0) signalsobject Corroborating signals used in confidence
Scenario Expected Latency Notes Exact + fuzzy only (50 docs) 50-200ms In-memory string matching With semantic tier (50 docs) 200-500ms Qdrant vector queries With visual comparison (50 docs) 500-1500ms Multiple vector queries Large reference set (200 docs) 300-800ms More candidates to compare
Reference documents are fetched once and reused across all source documents. The matching cascade short-circuits at the first successful tier, so ordering match_tiers from fastest to slowest (exact → fuzzy → semantic → visual) is optimal.
Limits:
Max source documents per execution: 50
Max reference documents fetched: 200 (configurable via reference_limit)
Common Pipeline Patterns
Drift Detection Pipeline
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"features" : [{
"feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" ,
"query_input" : "{{INPUT.query}}" ,
"top_k" : 50
}],
"final_top_k" : 50
}
},
{
"stage_type" : "apply" ,
"stage_id" : "cross_compare" ,
"parameters" : {
"reference_collection_id" : "col_documentation" ,
"source_field" : "content" ,
"reference_field" : "content" ,
"extraction_mode" : "labels" ,
"match_tiers" : [ "exact" , "fuzzy" , "semantic" ],
"include_visual_comparison" : true ,
"classifications" : [
{ "min_score" : 0.95 , "label" : "current" },
{ "min_score" : 0.75 , "label" : "needs_review" },
{ "min_score" : 0.0 , "label" : "outdated" }
]
}
}
]
[
{
"stage_type" : "filter" ,
"stage_id" : "feature_search" ,
"parameters" : {
"features" : [{
"feature_uri" : "mixpeek://text_extractor@v1/multilingual_e5_large_instruct_v1" ,
"query_input" : "{{INPUT.query}}" ,
"top_k" : 30
}],
"final_top_k" : 30
}
},
{
"stage_type" : "apply" ,
"stage_id" : "cross_compare" ,
"parameters" : {
"reference_collection_id" : "col_reference_catalog" ,
"source_field" : "product_name" ,
"reference_field" : "product_name" ,
"match_tiers" : [ "exact" , "fuzzy" ],
"output_mode" : "enrich" ,
"output_field" : "match_result"
}
},
{
"stage_type" : "apply" ,
"stage_id" : "json_transform" ,
"parameters" : {
"template" : "{ \" product \" : \" {{ DOC.product_name }} \" , \" match_status \" : \" {{ DOC.match_result[0].classification }} \" , \" score \" : {{ DOC.match_result[0].match_score }}}"
}
}
]
Error Handling
Error Behavior Reference collection not found Stage fails with error No reference documents found All elements classified as no_match Vector index not available Semantic/visual tiers skipped silently Source field missing on document Document skipped Exceeds max_working_documents Extra documents passed through unchanged
vs Other Enrichment Stages
Feature cross_compare document_enrich api_call Purpose Multi-tier comparison with classification Simple field join/lookup External API enrichment Data source Internal Qdrant collections Internal Qdrant collections External HTTP APIs Matching Cascading: exact → fuzzy → semantic → visual Top-1 vector or key match N/A Output Classified findings with scores Joined fields API response Latency 50-1500ms 5-20ms 100-500ms Best for Drift detection, dedup, compliance Cross-collection joins Third-party data