[YOUR VOICE] The Claim
Chain-of-thought prompting is the default recommendation for complex LLM tasks. But spatial UI tasks β clicking a specific button, reading a specific label, enumerating visible elements β degrade when the model is asked to reason step-by-step. The reasoning introduces spatial hallucinations.
The Mechanism
MISSING β Experimental setup: same UI tasks with and without CoT prompting across multiple VLMs
MISSING β Specific failure patterns observed (coordinate drift during reasoning, element hallucination in enumeration, spatial confusion in multi-step CoT)
MISSING β The suppression technique used in Leith and its effect on accuracy
The Evidence
MISSING β Comparative accuracy table: CoT-enabled vs CoT-suppressed across task types
MISSING β Example failure cases showing spatial hallucination during CoT
[YOUR VOICE] Implications
MISSING β When to use CoT and when to suppress it. The broader lesson about prompt engineering for spatial tasks.
Open Questions
- Is this a VLM architecture limitation or a training data gap?
- Do models fine-tuned on spatial tasks still exhibit this problem?
- Whatβs the minimum reasoning the model needs to complete multi-step UI tasks without CoT?
Reference Documents
| Document | What it covers |
|---|---|
| Leith _docs/ | MISSING β CoT suppression implementation and results |
| Prompt engineering experiments | MISSING β Full experimental protocol |