ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mapping

2025-01-06Multimedia Systems 2025Code Available0· sign in to hype

Modafar Al-Shouha, Gábor Szűcs

Code Available — Be the first to reproduce this paper.

Code

github.com/modafarshouha/ReDiT
In paperpytorch★ 0

Abstract

Large models (LMs) have achieved remarkable results in vision-language tasks. Such models are trained on vast amount of data then fine-tuned for downstream tasks like visual question answering (VQA). This wide exposure to data along with the complexity in a multi-modal setup (e.g. VQA) demand formalizing an extended definition of what constitutes out-of-distribution (OOD) condition for these models. Moreover, the input difficulty is expected to influence the model's performance, and it should be reflected on its confidence scoring. In this work, we primarily address large visual question answering (LVQA) models. We extend the classical boundaries of OOD definition and introduce a novel customizable dataset that simulates various challenges for LVQA models; i.e. 3U-VQA dataset. Moreover, we present a categorical scale to assess the input scenario difficulty. This scale is used to improve the reliability of the answer confidence score by re-evaluating it through adjusting a temperature parameter in the softmax function. Lastly, we study the credibility of our categorization and show that our re-evaluating method assists in reducing the overlap between correct and incorrect LVQA model predictions' scores.

Tasks

Question Answering Visual Question Answering Visual Question Answering (VQA)

ReDiT: Re‑evaluating large visual question answering model confidence by defining input scenario Difficulty and applying Temperature mapping

Code

Abstract

Tasks

Reproductions