Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

2026-03-02Code Available0· sign in to hype

Yuan Wu, Zongxian Yang, Jiayu Qian, Songpan Gao, Guanxing Chen, Qiankun Li, Yu-An Huang, Zhi-An Huang

Code Available — Be the first to reproduce this paper.

Code

github.com/tianyin123/better_eyes_better_thoughts
OfficialIn paper★ 2

Abstract

Large vision-language models (VLMs) often benefit from chain-of-thought (CoT) prompting in general domains, yet its efficacy in medical vision-language tasks remains underexplored. We report a counter-intuitive trend: on medical visual question answering, CoT frequently underperforms direct answering (DirA) across general-purpose and medical-specific models. We attribute this to a medical perception bottleneck: subtle, domain-specific cues can weaken visual grounding, and CoT may compound early perceptual uncertainty rather than correct it. To probe this hypothesis, we introduce two training-free, inference-time grounding interventions: (i) perception anchoring via region-of-interest cues and (ii) description grounding via high-quality textual guidance. Across multiple benchmarks and model families, these interventions improve accuracy, mitigate CoT degradation, and in several settings reverse the CoT--DirA inversion. Our findings suggest that reliable clinical VLMs require robust visual grounding and cross-modal alignment, beyond extending text-driven reasoning chains. Code is available here.

Better Eyes, Better Thoughts: Why Vision Chain-of-Thought Fails in Medicine

Code

Abstract

Reproductions