3 min read

Benchmarking UI Detection on ScreenSpot-Pro

Table of Contents

[YOUR VOICE] The Claim

Most UI detection benchmarks test on clean, synthetic screenshots. ScreenSpot-Pro uses real professional applications β€” and the gap between synthetic performance and real-world performance is where most tools fall apart.


The Mechanism

ScreenSpot-Pro provides 1,581 annotated UI elements across 26 professional macOS applications. The benchmark tests coverage (what percentage of annotated elements the detector finds) rather than precision alone, because for downstream agent tasks, a missed button is worse than a false positive.

Evaluation protocol

  1. Run detection on each screenshot
  2. Match detected elements to ground-truth annotations using IoU threshold
  3. Calculate per-application and aggregate coverage rates
  4. Separate text elements from icon elements (different detection mechanisms)

The Evidence

Per-category results

CategoryVision-onlyVision + YOLODelta
Text elements68.1%72.4%+4.3pp
Icon elements42.5%87.6%+45.1pp
Overall57.3%90.8%+33.5pp

MISSING β€” Per-application breakdown (which apps are hardest, which are easiest)

What the numbers mean

The 33.5 percentage point improvement from adding YOLO comes almost entirely from icons. Vision already handles text well. The architecture capitalizes on this asymmetry rather than throwing a general-purpose VLM at both problems.


[YOUR VOICE] Implications

MISSING β€” What this benchmark approach reveals about how UI detection should be evaluated. Why coverage matters more than precision for agent tasks.


Open Questions

  • How should benchmarks weight different element types (is a missed close button worse than a missed label)?
  • What’s the right IoU threshold for agent-relevant detection?
  • How does performance vary across screen resolutions and display scaling?

Reference Documents

DocumentWhat it covers
uitag READMEBenchmark results and methodology
ScreenSpot-Pro datasetMISSING β€” dataset construction methodology
Per-application breakdownMISSING β€” detailed per-app results