HRFormer: High-Resolution Transformer for Dense Prediction

2021-10-18Code Available1· sign in to hype

Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang

Code Available — Be the first to reproduce this paper.

Code

github.com/HRNet/HRFormer
OfficialIn paperpytorch★ 521

Abstract

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by 1.3 AP on COCO pose estimation with 50\% fewer parameters and 30\% fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.

Tasks

Image Classification Multi-Person Pose Estimation Pose Estimation Prediction Semantic Segmentation Vocal Bursts Intensity Prediction

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	HRFormer-B	Top 1 Accuracy	82.8	—	Unverified
ImageNet	HRFormer-T	Top 1 Accuracy	78.5	—	Unverified

HRFormer: High-Resolution Transformer for Dense Prediction

Code

Abstract

Tasks

Benchmark Results

Reproductions