[YOUR VOICE] The Claim
VLM-only UI detection is the wrong default. Splitting text detection (Apple Vision) from icon detection (fine-tuned YOLO) produces better accuracy at 100x less cost — and the architecture explains why.
The Mechanism
uitag processes screenshots through a seven-stage pipeline:
- Apple Vision for text and rectangle detection
- YOLO tiled detection for icons and non-text UI elements
- Overlap deduplication across both detection sources
- OCR correction for misread labels
- Text block grouping for multi-line elements
- Set-of-Mark annotation with numbered markers
- JSON manifest generation with bounding boxes, labels, and coordinates
The key architectural insight: text detection and icon detection are fundamentally different problems. Apple Vision already solves text with near-perfect accuracy and sub-second latency. Routing only the unsolved problem (icons) to a heavier model keeps total inference time under 5 seconds while covering 90.8% of UI elements.
The Evidence
ScreenSpot-Pro benchmark
Testing against 1,581 annotations across 26 professional macOS applications. Metric: center-hit (does any detection’s bounding box contain the center of the ground-truth target?).
| Pipeline | Text | Icon | Overall |
|---|---|---|---|
| Apple Vision + YOLO | 92.7% | 87.6% | 90.8% |
| YOLO only | 82.4% | 75.7% | 80.1% |
| Apple Vision only | 66.4% | 42.5% | 57.3% |
Additional out-of-distribution benchmarks (YOLO only, no Apple Vision):
| Benchmark | Metric | Score |
|---|---|---|
| GroundCUA (500 images, 30K GT elements) | Recall@IoU>=0.5 | 94.0% |
| GroundCUA | Precision@IoU>=0.5 | 83.6% |
| UI-Vision (1,181 images) | Recall@IoU>=0.5 | 83.5% |
What the YOLO model detects
Nine element classes derived from GroundCUA’s annotation taxonomy:
| Class | Examples |
|---|---|
| Button | Toolbar buttons, dialog buttons, toggles |
| Menu | Menu bars, context menus, dropdowns |
| Input_Elements | Text fields, search boxes, spinners |
| Navigation | Tabs, breadcrumbs, tree nodes |
| Information_Display | Status bars, tooltips, labels |
| Sidebar | Side panels, nav rails |
| Visual_Elements | Icons, thumbnails, separators |
| Others | Scrollbars, handles, dividers |
| Unknown | Ambiguous elements |
Training details
| Parameter | Value |
|---|---|
| Base model | YOLO11s (pretrained) |
| Dataset | GroundCUA tiled (224K train, 25K val tiles) |
| Tile size | 640x640, 20% overlap |
| Epochs | 100 |
| Hardware | 2x H100 PCIe 80GB (DDP) |
| Wall clock | 19.75 hours |
| mAP@0.5 (val) | 0.792 |
| Model size | 18 MB |
Augmentation choices reflect the domain: UI elements are axis-aligned, so rotation, flipping, and mosaic are disabled. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.
On-device, no API dependency
The entire pipeline runs locally on macOS. No API calls, no cloud inference, no data leaving the device.
Source: github.com/swaylenhayes/uitag Model: huggingface.co/swaylenhayes/uitag-yolo11s-ui-detect-v1
[YOUR VOICE] Implications
MISSING — What this means for the VLM agent ecosystem. Who should care about the detection/intelligence split.
Open Questions
- How does coverage degrade on non-standard UI frameworks (Electron with custom components, game UIs)?
- What’s the minimum training set size for domain-specific YOLO fine-tuning?
- Can the architecture generalize to mobile screenshots (iOS, Android) with a different Vision backend?
Reference Documents
| Document | What it covers |
|---|---|
| uitag README | Full pipeline documentation, installation, benchmarks |
| YOLO model card | Model architecture, training data, performance metrics |
| ScreenSpot-Pro methodology | MISSING — per-application breakdown and evaluation protocol |