Benchmarking UI Detection on ScreenSpot-Pro

[YOUR VOICE] The Claim

Most UI detection benchmarks test on clean, synthetic screenshots. ScreenSpot-Pro uses real professional applications — and the gap between synthetic performance and real-world performance is where most tools fall apart.

The Mechanism

ScreenSpot-Pro provides 1,581 annotated UI elements across 26 professional macOS applications. The benchmark tests coverage (what percentage of annotated elements the detector finds) rather than precision alone, because for downstream agent tasks, a missed button is worse than a false positive.

Evaluation protocol

Run detection on each screenshot
Match detected elements to ground-truth annotations using IoU threshold
Calculate per-application and aggregate coverage rates
Separate text elements from icon elements (different detection mechanisms)

The Evidence

Per-category results

Category	Vision-only	Vision + YOLO	Delta
Text elements	68.1%	72.4%	+4.3pp
Icon elements	42.5%	87.6%	+45.1pp
Overall	57.3%	90.8%	+33.5pp

MISSING — Per-application breakdown (which apps are hardest, which are easiest)

What the numbers mean

The 33.5 percentage point improvement from adding YOLO comes almost entirely from icons. Vision already handles text well. The architecture capitalizes on this asymmetry rather than throwing a general-purpose VLM at both problems.

[YOUR VOICE] Implications

MISSING — What this benchmark approach reveals about how UI detection should be evaluated. Why coverage matters more than precision for agent tasks.

Open Questions

How should benchmarks weight different element types (is a missed close button worse than a missed label)?
What’s the right IoU threshold for agent-relevant detection?
How does performance vary across screen resolutions and display scaling?

Reference Documents

Document	What it covers
uitag README	Benchmark results and methodology
ScreenSpot-Pro dataset	MISSING — dataset construction methodology
Per-application breakdown	MISSING — detailed per-app results

Hybrid UI Detection: Why We Split Vision and Intelligence

Why Detection and Intelligence Should Be Separate Layers