Skip to content

What is Unified Detection?

Why Unified Detection exists

  • Running Find Description (FD) and Find Everything (FE) separately created multiple intermediate fields and potential conflicts (e.g., vlm_detections, sam_detections).
  • Unified Detection consolidates all detections into a single source of truth: unified_detections.
  • This ensures downstream operators (Find Similar, Modify Detections) operate consistently and reduces human error.

Key insight

Unified Detection bridges semantic information (what objects are) with category-agnostic completeness (everything that looks like an object), forming the foundation for embedding-driven labeling.


How it works

SAM embedding generation

  • Computes a dense image embedding per sample (leip_embedding).
  • These embeddings capture local texture, shape, and boundary information for every pixel.

Find Description (FD)

  • Vision-language models (Florence-2, OWL-v2) predict labeled bounding boxes based on prompts.
  • Overlaps are reconciled (OWL preferred if confidence matters).
  • Outputs are merged into unified_detections.

Find Everything (FE)

  • Segment Anything Model (SAM) proposes additional objects, labeled as unknown.
  • Uniform grid points act as prompts to detect visually coherent objects beyond VLM predictions.
  • Post-processing includes NMS and area filtering.

Merge rules

  • VLM boxes skipped if overlapping confirmed detections (avoid overwriting).
  • SAM boxes added only if novel, labeled as unknown.
  • BBox area computed for each detection.
  • Per-detection embeddings pooled from SAM features.

Why it matters

  • Provides a single, consistent field (unified_detections) for downstream tasks.
  • Combines semantic grounding (FD) with object completeness (FE).
  • Simplifies SME workflows: fewer fields to track, less chance of overwriting or missing detections.
  • Ensures clustering and label propagation (Find Similar) have a coherent input.