Self-supervised Audiovisual Representation Learning for Remote Sensing Data

2021-08-02Code Available1· sign in to hype

Konrad Heidler, Lichao Mou, Di Hu, Pu Jin, Guangyao Li, Chuang Gan, Ji-Rong Wen, Xiao Xiang Zhu

Code Available — Be the first to reproduce this paper.

Code

github.com/khdlr/SoundingEarth
OfficialIn paperpytorch★ 34

Abstract

Many current deep learning approaches make extensive use of backbone networks pre-trained on large datasets like ImageNet, which are then fine-tuned to perform a certain task. In remote sensing, the lack of comparable large annotated datasets and the wide diversity of sensing platforms impedes similar developments. In order to contribute towards the availability of pre-trained backbone networks in remote sensing, we devise a self-supervised approach for pre-training deep neural networks. By exploiting the correspondence between geo-tagged audio recordings and remote sensing imagery, this is done in a completely label-free manner, eliminating the need for laborious manual annotation. For this purpose, we introduce the SoundingEarth dataset, which consists of co-located aerial imagery and audio samples all around the world. Using this dataset, we then pre-train ResNet models to map samples from both modalities into a common embedding space, which encourages the models to understand key properties of a scene that influence both visual and auditory appearance. To validate the usefulness of the proposed approach, we evaluate the transfer learning performance of pre-trained weights obtained against weights obtained through other means. By fine-tuning the models on a number of commonly used remote sensing datasets, we show that our approach outperforms existing pre-training strategies for remote sensing imagery. The dataset, code and pre-trained model weights will be available at https://github.com/khdlr/SoundingEarth.

Tasks

Cross-Modal Retrieval Representation Learning Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
SoundingEarth	ResNet-18	Median Rank	565	—	Unverified

Self-supervised Audiovisual Representation Learning for Remote Sensing Data

Code

Abstract

Tasks

Benchmark Results

Reproductions