Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

2023-06-01Code Available0· sign in to hype

Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/hiera
tf★ 0

Abstract

Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at https://github.com/facebookresearch/hiera.

Tasks

Action Classification Action Recognition Action Recognition In Videos Image Classification Instance Segmentation Object Detection Video Recognition

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
AVA v2.2	Hiera-H (K700 PT+FT)	mAP	43.3	—	Unverified
Something-Something V2	Hiera-L (no extra data)	Top-1 Accuracy	76.5	—	Unverified

Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles

Code

Abstract

Tasks

Benchmark Results

Reproductions