uitag

What it does

Set-of-Mark detection system that converts macOS screenshots into structured element maps. Combines Apple Vision’s text and rectangle detection with a fine-tuned YOLO model for icons.

Architecture

Seven-stage pipeline: Apple Vision → YOLO tiled detection → overlap deduplication → OCR correction → text block grouping → SoM annotation → JSON manifest generation.

Key numbers

Metric	Vision-only	Vision + YOLO
Element coverage	57.3%	90.8%
Icon coverage	42.5%	87.6%
Processing time	~1s	~5s

Benchmarked on ScreenSpot-Pro: 1,581 annotations across 26 professional macOS applications.

Technical details

Runs entirely on-device (no API calls)
Bundled YOLO model: 18 MB, trained on GroundCUA (MIT-licensed)
Python 3.10+, macOS required
134 unit tests
pip install uitag

Status

Active. First HuggingFace model published 2026-03-29: uitag-yolo11s-ui-detect-v1