Visual Spatial Reasoning
Fangyu Liu, Guy Emerson, Nigel Collier
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/cambridgeltl/visual-spatial-reasoningOfficialIn paperpytorch★ 140
- github.com/sohojoe/clip_visual-spatial-reasoningOfficialIn paperpytorch★ 9
- github.com/ziyan-xiaoyu/spatialmqapytorch★ 20
- github.com/MindCode-4/code-1/tree/main/viltmindspore★ 0
Abstract
Spatial relations are a basic part of human cognition. However, they are expressed in natural language in a variety of ways, and previous work has suggested that current vision-and-language models (VLMs) struggle to capture relational information. In this paper, we present Visual Spatial Reasoning (VSR), a dataset containing more than 10k natural text-image pairs with 66 types of spatial relations in English (such as: under, in front of, and facing). While using a seemingly simple annotation format, we show how the dataset includes challenging linguistic phenomena, such as varying reference frames. We demonstrate a large gap between human and model performance: the human ceiling is above 95%, while state-of-the-art models only achieve around 70%. We observe that VLMs' by-relation performances have little correlation with the number of training examples and the tested models are in general incapable of recognising relations concerning the orientations of objects.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| VSR | LXMERT | accuracy | 70.1 | — | Unverified |
| VSR | ViLT | accuracy | 69.3 | — | Unverified |
| VSR | CLIP (finetuned) | accuracy | 65.1 | — | Unverified |
| VSR | CLIP (frozen) | accuracy | 56 | — | Unverified |
| VSR | VisualBERT | accuracy | 55.2 | — | Unverified |