2 min read

Trust Calibration Without Confidence Scores

Table of Contents

[YOUR VOICE] The Claim

Confidence scores from VLMs are unreliable for UI interaction tasks. A model that says it’s 95% confident about a click target is wrong often enough to be dangerous. Trust needs to be calibrated from behavioral signals β€” consistency across attempts, agreement between detection methods, and success/failure history.


The Mechanism

MISSING β€” How Leith’s trust calibration works: multi-signal verification, behavioral consistency checks, episodic memory for tracking past success rates per element type

MISSING β€” Why confidence scores fail: specific examples of high-confidence misclicks and low-confidence correct actions


The Evidence

MISSING β€” Comparative data: confidence-based trust vs. behavioral trust calibration on UI task accuracy


[YOUR VOICE] Implications

MISSING β€” Broader lesson for any system that needs to know when to trust LLM output.


Open Questions

  • Can behavioral trust signals transfer across applications (does learning to trust in Safari help in Figma)?
  • What’s the minimum interaction history needed for reliable calibration?
  • How does trust decay over time or across UI updates?

Reference Documents

DocumentWhat it covers
Leith _docs/MISSING β€” Trust calibration implementation