ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mapping
Modafar Al-Shouha, Gábor Szűcs
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/modafarshouha/ReDiTIn paperpytorch★ 0
Abstract
Large models (LMs) have achieved remarkable results in vision-language tasks. Such models are trained on vast amount of data then fine-tuned for downstream tasks like visual question answering (VQA). This wide exposure to data along with the complexity in a multi-modal setup (e.g. VQA) demand formalizing an extended definition of what constitutes out-of-distribution (OOD) condition for these models. Moreover, the input difficulty is expected to influence the model's performance, and it should be reflected on its confidence scoring. In this work, we primarily address large visual question answering (LVQA) models. We extend the classical boundaries of OOD definition and introduce a novel customizable dataset that simulates various challenges for LVQA models; i.e. 3U-VQA dataset. Moreover, we present a categorical scale to assess the input scenario difficulty. This scale is used to improve the reliability of the answer confidence score by re-evaluating it through adjusting a temperature parameter in the softmax function. Lastly, we study the credibility of our categorization and show that our re-evaluating method assists in reducing the overlap between correct and incorrect LVQA model predictions' scores.