ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

2023-01-02CVPR 2023Code Available3· sign in to hype

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie

Code Available — Be the first to reproduce this paper.

Code

github.com/facebookresearch/convnext-v2
OfficialIn paperpytorch★ 1,994
github.com/vishalned/MMEarth-train
pytorch★ 63
github.com/Jacky-Android/convnext-v2-pytorch
pytorch★ 48
github.com/zibbini/convnext-v2_tensorflow
tf★ 13
github.com/MindSpore-paper-code-3/code7/tree/main/ghostnetv2
mindspore★ 0
github.com/pwc-1/Paper-8/tree/main/convnext
mindspore★ 0
github.com/leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/convnext
tf★ 0
github.com/mindspore-courses/External-Attention-MindSpore/blob/main/model/backbone/convnextv2.py
mindspore★ 0
github.com/lyqcom/convnext
mindspore★ 0
github.com/2024-MindSpore-1/Code2/tree/main/model-1/convnextv2
mindspore★ 0

Abstract

Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

Tasks

Object Detection Representation Learning Self-Supervised Learning Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ADE20K	ConvNeXt V2-H (FCMAE)	Validation mIoU	55	—	Unverified
ADE20K	Swin V2-H	Validation mIoU	54.2	—	Unverified
ADE20K	ConvNeXt V2-L	Validation mIoU	53.7	—	Unverified
ADE20K	Swin-L	Validation mIoU	53.5	—	Unverified
ADE20K	Swin-B	Validation mIoU	52.8	—	Unverified
ADE20K	ConvNeXt V2-B	Validation mIoU	52.1	—	Unverified
ADE20K	ConvNeXt V2-L (Supervised)	Validation mIoU	51.6	—	Unverified
ADE20K	ConvNeXt V1-L	Validation mIoU	50.5	—	Unverified
ADE20K	ConvNeXt V1-B	Validation mIoU	49.9	—	Unverified

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Code

Abstract

Tasks

Benchmark Results

Reproductions