What is Unified Detection?¶

Why Unified Detection exists¶

Running Find Description (FD) and Find Everything (FE) separately created multiple intermediate fields and potential conflicts (e.g., vlm_detections, sam_detections).
Unified Detection consolidates all detections into a single source of truth: unified_detections.
This ensures downstream operators (Find Similar, Modify Detections) operate consistently and reduces human error.

Key insight

Unified Detection bridges semantic information (what objects are) with category-agnostic completeness (everything that looks like an object), forming the foundation for embedding-driven labeling.

How it works¶

SAM embedding generation

Computes a dense image embedding per sample (leip_embedding).
These embeddings capture local texture, shape, and boundary information for every pixel.

Find Description (FD)

Vision-language models (Florence-2, OWL-v2) predict labeled bounding boxes based on prompts.
Overlaps are reconciled (OWL preferred if confidence matters).
Outputs are merged into unified_detections.

Find Everything (FE)

Segment Anything Model (SAM) proposes additional objects, labeled as unknown.
Uniform grid points act as prompts to detect visually coherent objects beyond VLM predictions.
Post-processing includes NMS and area filtering.

Merge rules

VLM boxes skipped if overlapping confirmed detections (avoid overwriting).
SAM boxes added only if novel, labeled as unknown.
BBox area computed for each detection.
Per-detection embeddings pooled from SAM features.

Why it matters¶

Provides a single, consistent field (unified_detections) for downstream tasks.
Combines semantic grounding (FD) with object completeness (FE).
Simplifies SME workflows: fewer fields to track, less chance of overwriting or missing detections.
Ensures clustering and label propagation (Find Similar) have a coherent input.