What it does
Set-of-Mark detection system that converts macOS screenshots into structured element maps. Combines Apple Visionβs text and rectangle detection with a fine-tuned YOLO model for icons.
Architecture
Seven-stage pipeline: Apple Vision β YOLO tiled detection β overlap deduplication β OCR correction β text block grouping β SoM annotation β JSON manifest generation.
Key numbers
| Metric | Vision-only | Vision + YOLO |
|---|---|---|
| Element coverage | 57.3% | 90.8% |
| Icon coverage | 42.5% | 87.6% |
| Processing time | ~1s | ~5s |
Benchmarked on ScreenSpot-Pro: 1,581 annotations across 26 professional macOS applications.
Technical details
- Runs entirely on-device (no API calls)
- Bundled YOLO model: 18 MB, trained on GroundCUA (MIT-licensed)
- Python 3.10+, macOS required
- 134 unit tests
pip install uitag
Status
Active. First HuggingFace model published 2026-03-29: uitag-yolo11s-ui-detect-v1