What is Unified Detection?¶
Why Unified Detection exists¶
- Running Find Description (FD) and Find Everything (FE) separately created multiple intermediate fields and potential conflicts (e.g.,
vlm_detections,sam_detections). - Unified Detection consolidates all detections into a single source of truth:
unified_detections. - This ensures downstream operators (Find Similar, Modify Detections) operate consistently and reduces human error.
Key insight
Unified Detection bridges semantic information (what objects are) with category-agnostic completeness (everything that looks like an object), forming the foundation for embedding-driven labeling.
How it works¶
SAM embedding generation
- Computes a dense image embedding per sample (
leip_embedding). - These embeddings capture local texture, shape, and boundary information for every pixel.
Find Description (FD)
- Vision-language models (Florence-2, OWL-v2) predict labeled bounding boxes based on prompts.
- Overlaps are reconciled (OWL preferred if confidence matters).
- Outputs are merged into
unified_detections.
Find Everything (FE)
- Segment Anything Model (SAM) proposes additional objects, labeled as
unknown. - Uniform grid points act as prompts to detect visually coherent objects beyond VLM predictions.
- Post-processing includes NMS and area filtering.
Merge rules
- VLM boxes skipped if overlapping confirmed detections (avoid overwriting).
- SAM boxes added only if novel, labeled as
unknown. - BBox area computed for each detection.
- Per-detection embeddings pooled from SAM features.
Why it matters¶
- Provides a single, consistent field (
unified_detections) for downstream tasks. - Combines semantic grounding (FD) with object completeness (FE).
- Simplifies SME workflows: fewer fields to track, less chance of overwriting or missing detections.
- Ensures clustering and label propagation (Find Similar) have a coherent input.