SOTAVerified

R-MONet: Region-Based Unsupervised Scene Decomposition and Representation via Consistency of Object Representations

2021-01-01Unverified0· sign in to hype

Shengxin Qian

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Decomposing a complex scene into multiple objects is a natural instinct of an intelligent vision system. Recently, the interest in unsupervised scene representation learning emerged and many previous works tackle this by decomposing scene into object representations either in the form of segmentation masks or position and scale latent variables (i.e. bounding boxes). We observe that these two types of representation both contain object geometric information and should be consistent with each other. Inspired by this observation, we provide an unsupervised generative framework called R-MONet that can generate objects geometric representation in the form of bounding boxes and foreground segmentation masks simultaneously. While bounding boxes can represent the region of interest for generating foreground segmentation masks, the foreground segmentation masks can also be used to supervise bounding boxes learning with the Multi-Otsu Thresholding method. Through the experiments on CLEVR and Multi-dSprites datasets, we show that ensuring the consistency of two types of representation can help the model to decompose the scene and learn better object geometric representations.

Tasks

Reproductions