2 min read

uitag

Table of Contents

What it does

Set-of-Mark detection system that converts macOS screenshots into structured element maps. Combines Apple Vision’s text and rectangle detection with a fine-tuned YOLO model for icons.

Architecture

Seven-stage pipeline: Apple Vision β†’ YOLO tiled detection β†’ overlap deduplication β†’ OCR correction β†’ text block grouping β†’ SoM annotation β†’ JSON manifest generation.

Key numbers

MetricVision-onlyVision + YOLO
Element coverage57.3%90.8%
Icon coverage42.5%87.6%
Processing time~1s~5s

Benchmarked on ScreenSpot-Pro: 1,581 annotations across 26 professional macOS applications.

Technical details

  • Runs entirely on-device (no API calls)
  • Bundled YOLO model: 18 MB, trained on GroundCUA (MIT-licensed)
  • Python 3.10+, macOS required
  • 134 unit tests
  • pip install uitag

Status

Active. First HuggingFace model published 2026-03-29: uitag-yolo11s-ui-detect-v1