Hybrid UI Detection: Why We Split Vision and Intelligence

[YOUR VOICE] The Claim

VLM-only UI detection is the wrong default. Splitting text detection (Apple Vision) from icon detection (fine-tuned YOLO) produces better accuracy at 100x less cost — and the architecture explains why.

The Mechanism

uitag processes screenshots through a seven-stage pipeline:

Apple Vision for text and rectangle detection
YOLO tiled detection for icons and non-text UI elements
Overlap deduplication across both detection sources
OCR correction for misread labels
Text block grouping for multi-line elements
Set-of-Mark annotation with numbered markers
JSON manifest generation with bounding boxes, labels, and coordinates

The key architectural insight: text detection and icon detection are fundamentally different problems. Apple Vision already solves text with near-perfect accuracy and sub-second latency. Routing only the unsolved problem (icons) to a heavier model keeps total inference time under 5 seconds while covering 90.8% of UI elements.

The Evidence

ScreenSpot-Pro benchmark

Testing against 1,581 annotations across 26 professional macOS applications. Metric: center-hit (does any detection’s bounding box contain the center of the ground-truth target?).

Pipeline	Text	Icon	Overall
Apple Vision + YOLO	92.7%	87.6%	90.8%
YOLO only	82.4%	75.7%	80.1%
Apple Vision only	66.4%	42.5%	57.3%

Additional out-of-distribution benchmarks (YOLO only, no Apple Vision):

Benchmark	Metric	Score
GroundCUA (500 images, 30K GT elements)	Recall@IoU>=0.5	94.0%
GroundCUA	Precision@IoU>=0.5	83.6%
UI-Vision (1,181 images)	Recall@IoU>=0.5	83.5%

What the YOLO model detects

Nine element classes derived from GroundCUA’s annotation taxonomy:

Class	Examples
Button	Toolbar buttons, dialog buttons, toggles
Menu	Menu bars, context menus, dropdowns
Input_Elements	Text fields, search boxes, spinners
Navigation	Tabs, breadcrumbs, tree nodes
Information_Display	Status bars, tooltips, labels
Sidebar	Side panels, nav rails
Visual_Elements	Icons, thumbnails, separators
Others	Scrollbars, handles, dividers
Unknown	Ambiguous elements

Training details

Parameter	Value
Base model	YOLO11s (pretrained)
Dataset	GroundCUA tiled (224K train, 25K val tiles)
Tile size	640x640, 20% overlap
Epochs	100
Hardware	2x H100 PCIe 80GB (DDP)
Wall clock	19.75 hours
mAP@0.5 (val)	0.792
Model size	18 MB

Augmentation choices reflect the domain: UI elements are axis-aligned, so rotation, flipping, and mosaic are disabled. Removing geometric augmentation caused overfitting within 6 epochs in diagnostic runs.

On-device, no API dependency

The entire pipeline runs locally on macOS. No API calls, no cloud inference, no data leaving the device.

Source: github.com/swaylenhayes/uitag Model: huggingface.co/swaylenhayes/uitag-yolo11s-ui-detect-v1

[YOUR VOICE] Implications

MISSING — What this means for the VLM agent ecosystem. Who should care about the detection/intelligence split.

Open Questions

How does coverage degrade on non-standard UI frameworks (Electron with custom components, game UIs)?
What’s the minimum training set size for domain-specific YOLO fine-tuning?
Can the architecture generalize to mobile screenshots (iOS, Android) with a different Vision backend?

Reference Documents

Document	What it covers
uitag README	Full pipeline documentation, installation, benchmarks
YOLO model card	Model architecture, training data, performance metrics
ScreenSpot-Pro methodology	MISSING — per-application breakdown and evaluation protocol

Site Launch

Benchmarking UI Detection on ScreenSpot-Pro