SOTAVerified

Focal-WNet: An Architecture Unifying Convolution and Attention for Depth Estimation

2022-07-18I2CT 2022Code Available0· sign in to hype

Gouthamaan Manimaran, Swaminathan J

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Extracting depth information from a single RGB image is a fundamental and challenging task in computer vision with wide-ranging applications. This task cannot be solved using traditional methods like multi-view geometry but only by deep learning. Existing methods using convolutional neural nets produce inconsistent and blurry results due to the lack of long-range dependencies. With the recent success of Transformer networks in computer vision, which can process information locally and globally, we leverage this idea to propose a novel architecture named Focal-WNet in this paper. This architecture consists of two separate encoders and a single decoder. The main aim of this network is to learn most monocular depth cues like relative scale, contrast differences, texture gradient, etc. We incorporate focal self-attention instead of vanilla self-attention to reduce the computational complexity of the network. Along with the focal transformer layers, we leverage a convolutional architecture to learn depth cues that cannot be learned by a transformer alone as some cues like occlusion require a local receptive field and are easier for a conv-net to learn. Extensive experiments show that the proposed Focal-WNet achieves competitive results on two challenging datasets.

Tasks

Reproductions