Why Detection and Intelligence Should Be Separate Layers

[YOUR VOICE] The Claim

Coupling detection and intelligence into a single VLM call is the default in every UI agent framework. It’s also why most UI agents are slow, expensive, and unreliable. The layers have different failure modes and should be engineered separately.

The Mechanism

Detection asks: what elements exist on this screen? Intelligence asks: which element should I interact with, and how?

These are different problems with different error profiles:

Detection errors are spatial — missed elements, wrong bounding boxes, overlapping regions
Intelligence errors are semantic — clicking the wrong button, misunderstanding context, hallucinating elements

When both run in a single VLM call, you can’t diagnose which layer failed. A missed click could be a detection miss or a reasoning error. Separating the layers makes each independently testable and improvable.

uitag handles detection. Leith handles intelligence. The interface between them is a structured JSON manifest — bounding boxes, labels, coordinates. Leith never sees raw pixels; it reasons over structured data.

The Evidence

Detection layer (uitag)

90.8% element coverage on ScreenSpot-Pro. Sub-5-second processing. On-device. The detection layer is fast, deterministic, and benchmarkable.

Source: uitag README

Intelligence layer (Leith)

MISSING — Leith performance data on structured input vs. raw screenshot input. CoT suppression results. Multi-signal verification accuracy.

The cost argument

MISSING — Comparative cost analysis: single VLM call (detection + intelligence) vs. split architecture (Vision + YOLO + LLM reasoning on structured data).

[YOUR VOICE] Implications

MISSING — Why this matters beyond this specific project. The broader argument for perceptual separation in agent architectures.

Open Questions

At what complexity threshold does the split architecture lose its advantage?
Can the structured manifest format become a standard interchange format for UI agents?
What’s the right abstraction boundary between detection and intelligence?

Reference Documents

Document	What it covers
uitag	Detection layer implementation
Leith _docs/	MISSING — Intelligence layer architecture and decisions
Architecture decision record	MISSING — Why the split was chosen

Benchmarking UI Detection on ScreenSpot-Pro

GUI-Specialized Apple Silicon VLM Matrix