Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

2023-01-01CVPR 2023Code Available0· sign in to hype

Chuanhao Li, Zhen Li, Chenchen Jing, Yunde Jia, Yuwei Wu

Code Available — Be the first to reproduce this paper.

Code

github.com/NeverMoreLCH/SSL2CG
Officialpytorch★ 8

Abstract

Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.

Tasks

Question Answering Self-Supervised Learning Video Grounding Visual Question Answering

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Code

Abstract

Tasks

Reproductions