SOTAVerified

Efficient Self-supervised Vision Pretraining with Local Masked Reconstruction

2022-06-01Code Available1· sign in to hype

Jun Chen, Ming Hu, Boyang Li, Mohamed Elhoseiny

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Self-supervised learning for computer vision has achieved tremendous progress and improved many downstream vision tasks such as image classification, semantic segmentation, and object detection. Among these, generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However, their global masked reconstruction mechanism is computationally demanding. To address this issue, we propose local masked reconstruction (LoMaR), a simple yet effective approach that performs masked reconstruction within a small window of 77 patches on a simple Transformer encoder, improving the trade-off between efficiency and accuracy compared to global masked reconstruction over the entire image. Extensive experiments show that LoMaR reaches 84.1% top-1 accuracy on ImageNet-1K classification, outperforming MAE by 0.5%. After finetuning the pretrained LoMaR on 384384 images, it can reach 85.4% top-1 accuracy, surpassing MAE by 0.6%. On MS COCO, LoMaR outperforms MAE by 0.5 AP^box on object detection and 0.5 AP^mask on instance segmentation. LoMaR is especially more computation-efficient on pretraining high-resolution images, e.g., it is 3.1 faster than MAE with 0.2% higher classification accuracy on pretraining 448448 images. This local masked reconstruction learning mechanism can be easily integrated into any other generative self-supervised learning approach. Our code is publicly available in https://github.com/junchen14/LoMaR.

Tasks

Reproductions