BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

2022-03-31Code Available4· sign in to hype

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

Code Available — Be the first to reproduce this paper.

Code

github.com/zhiqi-li/BEVFormer
OfficialIn papernone★ 23
github.com/fundamentalvision/BEVFormer
pytorch★ 4,376
github.com/valeoai/pointbev
pytorch★ 139

Abstract

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9\% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code is available at https://github.com/zhiqi-li/BEVFormer.

Tasks

3D Object Detection Autonomous Driving Bird's-Eye View Semantic Segmentation Robust Camera Only 3D Object Detection

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
DAIR-V2X-I	BEVFormer	AP\|R40(moderate)	50.7	—	Unverified
nuScenes	BEVFormer	NDS	0.57	—	Unverified
nuScenes Camera Only	BEVFormer	NDS	56.9	—	Unverified

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Code

Abstract

Tasks

Benchmark Results

Reproductions