[YOUR VOICE] The Claim
Confidence scores from VLMs are unreliable for UI interaction tasks. A model that says itβs 95% confident about a click target is wrong often enough to be dangerous. Trust needs to be calibrated from behavioral signals β consistency across attempts, agreement between detection methods, and success/failure history.
The Mechanism
MISSING β How Leithβs trust calibration works: multi-signal verification, behavioral consistency checks, episodic memory for tracking past success rates per element type
MISSING β Why confidence scores fail: specific examples of high-confidence misclicks and low-confidence correct actions
The Evidence
MISSING β Comparative data: confidence-based trust vs. behavioral trust calibration on UI task accuracy
[YOUR VOICE] Implications
MISSING β Broader lesson for any system that needs to know when to trust LLM output.
Open Questions
- Can behavioral trust signals transfer across applications (does learning to trust in Safari help in Figma)?
- Whatβs the minimum interaction history needed for reliable calibration?
- How does trust decay over time or across UI updates?
Reference Documents
| Document | What it covers |
|---|---|
| Leith _docs/ | MISSING β Trust calibration implementation |