Trust Calibration Without Confidence Scores

[YOUR VOICE] The Claim

Confidence scores from VLMs are unreliable for UI interaction tasks. A model that says it’s 95% confident about a click target is wrong often enough to be dangerous. Trust needs to be calibrated from behavioral signals — consistency across attempts, agreement between detection methods, and success/failure history.

The Mechanism

MISSING — How Leith’s trust calibration works: multi-signal verification, behavioral consistency checks, episodic memory for tracking past success rates per element type

MISSING — Why confidence scores fail: specific examples of high-confidence misclicks and low-confidence correct actions

The Evidence

MISSING — Comparative data: confidence-based trust vs. behavioral trust calibration on UI task accuracy

[YOUR VOICE] Implications

MISSING — Broader lesson for any system that needs to know when to trust LLM output.

Open Questions

Can behavioral trust signals transfer across applications (does learning to trust in Safari help in Figma)?
What’s the minimum interaction history needed for reliable calibration?
How does trust decay over time or across UI updates?

Reference Documents

Document	What it covers
Leith _docs/	MISSING — Trust calibration implementation

The Operator Orchestration Workstation

Multi-Signal Verification for VLM UI Agents