SOTAVerified

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

2020-08-01ECCV 2020Code Available1· sign in to hype

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, Leonidas Guibas

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a extitfine-grained object class and the underlying scene contains extitmultiple object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) extbf extitSr3D, which contains 83.5K template-based utterances leveraging extitspatial relations with other fine-grained object classes to localize a referred object in a given scene, and ii) extbf extitNr3D which contains 41.5K extitnatural, free-form, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high (>86\%, 92\% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object extitdirectly in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point-clouds) and creating multi-modal (3D) neural listeners. We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that language-assisted 3D object identification outperforms language-agnostic object classifiers.

Tasks

Reproductions