SOTAVerified

RCA: Region Conditioned Adaptation for Visual Abductive Reasoning

2023-03-18Code Available0· sign in to hype

Hao Zhang, Yeo Keat Ee, Basura Fernando

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Visual abductive reasoning aims to make likely explanations for visual observations. We propose a simple yet effective Region Conditioned Adaptation, a hybrid parameter-efficient fine-tuning method that equips the frozen CLIP with the ability to infer explanations from local visual cues. We encode ``local hints'' and ``global contexts'' into visual prompts of the CLIP model separately at fine and coarse-grained levels. Adapters are used for fine-tuning CLIP models for downstream tasks and we design a new attention adapter, that directly steers the focus of the attention map with trainable query and key projections of a frozen CLIP model. Finally, we train our new model with a modified contrastive loss to regress the visual feature simultaneously toward features of literal description and plausible explanations. The loss enables CLIP to maintain both perception and reasoning abilities. Experiments on the Sherlock visual abductive reasoning benchmark show that the RCA significantly outstands previous SOTAs, ranking the 1 on the leaderboards (e.g., Human Acc: RCA 31.74 vs CPT-CLIP 29.58, higher =better). We also validate the RCA is generalizable to local perception benchmarks like RefCOCO. We open-source our project at magenta https://github.com/LUNAProject22/RPA.

Tasks

Reproductions