Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

2024-08-30Unverified0· sign in to hype

Shuai Peng, Di Fu, Baole Wei, Yong Cao, Liangcai Gao, Zhi Tang

Unverified — Be the first to reproduce this paper.

Abstract

Despite the remarkable success of Vision Transformers (ViTs) in various visual tasks, they are often hindered by substantial computational cost. In this work, we introduce Vote\&Mix (VoMix), a plug-and-play and parameter-free token reduction method, which can be readily applied to off-the-shelf ViT models without any training. VoMix tackles the computational redundancy of ViTs by identifying tokens with high homogeneity through a layer-wise token similarity voting mechanism. Subsequently, the selected tokens are mixed into the retained set, thereby preserving visual information. Experiments demonstrate VoMix significantly improves the speed-accuracy tradeoff of ViTs on both images and videos. Without any training, VoMix achieves a 2 increase in throughput of existing ViT-H on ImageNet-1K and a 2.4 increase in throughput of existing ViT-L on Kinetics-400 video dataset, with a mere 0.3\% drop in top-1 accuracy.

Tasks

Token Reduction

Vote&Mix: Plug-and-Play Token Reduction for Efficient Vision Transformer

Abstract

Tasks

Reproductions