[YOUR VOICE] The Claim
Most UI detection benchmarks test on clean, synthetic screenshots. ScreenSpot-Pro uses real professional applications β and the gap between synthetic performance and real-world performance is where most tools fall apart.
The Mechanism
ScreenSpot-Pro provides 1,581 annotated UI elements across 26 professional macOS applications. The benchmark tests coverage (what percentage of annotated elements the detector finds) rather than precision alone, because for downstream agent tasks, a missed button is worse than a false positive.
Evaluation protocol
- Run detection on each screenshot
- Match detected elements to ground-truth annotations using IoU threshold
- Calculate per-application and aggregate coverage rates
- Separate text elements from icon elements (different detection mechanisms)
The Evidence
Per-category results
| Category | Vision-only | Vision + YOLO | Delta |
|---|---|---|---|
| Text elements | 68.1% | 72.4% | +4.3pp |
| Icon elements | 42.5% | 87.6% | +45.1pp |
| Overall | 57.3% | 90.8% | +33.5pp |
MISSING β Per-application breakdown (which apps are hardest, which are easiest)
What the numbers mean
The 33.5 percentage point improvement from adding YOLO comes almost entirely from icons. Vision already handles text well. The architecture capitalizes on this asymmetry rather than throwing a general-purpose VLM at both problems.
[YOUR VOICE] Implications
MISSING β What this benchmark approach reveals about how UI detection should be evaluated. Why coverage matters more than precision for agent tasks.
Open Questions
- How should benchmarks weight different element types (is a missed close button worse than a missed label)?
- Whatβs the right IoU threshold for agent-relevant detection?
- How does performance vary across screen resolutions and display scaling?
Reference Documents
| Document | What it covers |
|---|---|
| uitag README | Benchmark results and methodology |
| ScreenSpot-Pro dataset | MISSING β dataset construction methodology |
| Per-application breakdown | MISSING β detailed per-app results |