- Start with unlabeled data
- Use feature extraction to find relevant items
- Manually label a small reference set
- Automatically classify new items based on the reference set
- Create a self-improving system that gets better over time
Overview
This tutorial demonstrates two approaches to building an auto-labeling system:- Option A: Unified Approach (Recommended) - Single bucket/collection that grows smarter over time
- Option B: Separate Approach - Dedicated reference set with production data separated
- Upload unlabeled data with feature extraction
- Manually label a small reference set (10-20 examples per category)
- Configure taxonomy to auto-label new items based on similarity
- Review and label unknowns to continuously improve
Use Cases
- Product Recognition: Label product images, auto-tag new inventory
- People Identification: Build a face recognition system from photos
- Document Classification: Categorize documents by type or topic
- Object Detection: Label objects in images for training data
Option A: Unified Approach (Recommended)
The unified approach uses a single bucket and collection that references itself. As you label items, they immediately become part of the reference set for future matches.Step 1: Create Bucket and Collection
Create a bucket and collection with self-referencing taxonomy:Step 2: Upload Initial Unlabeled Data
Step 3: Manually Label Reference Set
Query documents and label them:- Label 10-20 examples per category minimum
- Include diverse examples (angles, lighting, backgrounds)
- Use consistent naming conventions
Step 4: Upload New Items - Auto-Labeling Works!
Now that you have labeled examples, new uploads auto-label automatically:- Feature extraction runs on the new image
- Taxonomy searches your labeled items for similar matches
- If similarity > 0.30 → Auto-labels (e.g.,
"Red Running Shoes") - If similarity < 0.30 → Leaves as
nullfor manual review
Step 5: Review and Label Unknowns
Find items that need manual labeling:Option B: Separate Approach
For more control, keep reference data separate from production data:- Reference bucket/collection: Curated, high-quality labeled examples
- Production bucket/collection: All data with auto-labels
- Need strict quality control on reference set
- Want to prevent noisy auto-labels from affecting matching
- Prefer to manually review before promoting items to reference
Step 1: Create Reference Bucket and Collection
Step 2: Upload and Label Reference Set
Upload 50-100 curated images to the reference bucket and manually label them:Step 3: Create Production Bucket and Collection
Step 4: Upload Production Data
New uploads auto-label based on the reference set:Step 5: Promote High-Confidence Items to Reference
Periodically review production data and promote high-confidence matches:Real-World Examples
Example 1: Face Recognition System
Example 2: Document Classification
Advanced Configuration
Tuning Confidence Thresholds
Theconfidence_threshold determines how conservative auto-labeling is:
| Threshold | Behavior | Use Case |
|---|---|---|
0.20-0.25 | Aggressive | High recall, more false positives |
0.30-0.35 | Balanced | Good starting point |
0.40-0.50 | Conservative | High precision, fewer auto-labels |
0.60+ | Very strict | Only exact matches |
- Start with
0.30 - Monitor false positive rate (wrong auto-labels)
- Check coverage (% of items auto-labeled)
- Adjust based on cost of errors:
- High cost of errors (e.g., medical imaging) → Higher threshold
- Low cost of errors (e.g., photo organization) → Lower threshold
Monitoring & Analytics
Track performance with these queries:- Auto-label coverage: % of new items auto-labeled
- Manual review queue: # of items with
label: null - Confidence distribution: Are matches clustered around threshold?
- False positive rate: Sample and manually verify auto-labels
Best Practices
Reference set quality:- Include diverse examples (angles, lighting, backgrounds)
- Use consistent naming conventions
- Aim for balanced distribution across categories
- Maintain high-quality, unambiguous images
- Create a labeling style guide
- Consider hierarchical labels:
"Shoes > Running > Red" - Define rules for edge cases
- Version your taxonomy as it evolves
- Review unknowns regularly
- Audit auto-labels periodically
- Add corrected examples when system makes mistakes
- Expand categories as needed
- Start with conservative threshold (0.40+)
- Implement human-in-the-loop for critical applications
- Enable feedback mechanism for corrections
- A/B test threshold changes
Troubleshooting
Too many unlabeled items
Causes: Threshold too high, insufficient reference examples, new categories Solutions:- Lower
confidence_thresholdto 0.25-0.30 - Add 20+ examples per category to reference set
- Review and label new categories
False positives (wrong labels)
Causes: Threshold too low, similar categories, poor quality references Solutions:- Raise
confidence_thresholdto 0.40+ - Add diverse examples to distinguish categories
- Clean up reference set
System not self-improving
Causes: Labels not syncing, configuration issues Solutions:- Verify
field_passthroughincludes label field - Check retriever filters for non-null labels
- Confirm bucket-to-collection sync is working
Summary
Workflow:- Create bucket and collection with feature extraction
- Upload unlabeled data (50-100 items)
- Manually label reference set (10-20 per category)
- Create taxonomy retriever pointing to labeled items
- New uploads auto-label based on similarity
- Review and label unknowns to improve system
- Start with zero labels, build incrementally
- Automate repetitive labeling
- Self-improving with each manual correction
- Scales from dozens to millions
- Choose unified (simpler) or separate (more control) approach
- Start with 50-100 reference items
- Test different confidence thresholds (start at 0.30)
- Monitor auto-label quality and adjust

