Multi-Signal Verification for VLM UI Agents

[YOUR VOICE] The Claim

Single-signal UI detection is inherently fragile. Multi-signal verification — checking that multiple independent detection methods agree on an element’s identity and location — is the mechanism that makes UI agents reliable enough for real workflows.

The Mechanism

MISSING — How multi-signal verification works in Leith: which signals are compared, how disagreements are resolved, the tiered fallback chain

MISSING — The signal types: Apple Vision text, YOLO icon detection, VLM spatial reasoning, OCR confirmation, coordinate consistency across frames

The Evidence

MISSING — Accuracy comparison: single-signal vs. multi-signal on real UI tasks

MISSING — Failure case analysis: what gets caught by multi-signal that single-signal misses

[YOUR VOICE] Implications

MISSING — The engineering cost of multi-signal verification vs. the reliability gain. When is it worth the latency?

Open Questions

How many independent signals are needed before verification reaches diminishing returns?
Can multi-signal verification be made fast enough for real-time UI interaction?
What happens when all signals agree but are all wrong (correlated failure)?

Reference Documents

Document	What it covers
Leith _docs/	MISSING — Multi-signal implementation
uitag detection pipeline	Source of two independent detection signals

Trust Calibration Without Confidence Scores

Memory Architecture for a Multi-Agent Ecosystem