ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers

2024-02-15Unverified0· sign in to hype

Kyle Marino, Pengmiao Zhang, Viktor Prasanna

Unverified — Be the first to reproduce this paper.

Abstract

Vision Transformers (ViTs) have emerged as a state-of-the-art solution for object classification tasks. However, their computational demands and high parameter count make them unsuitable for real-time inference, prompting the need for efficient hardware implementations. Existing hardware accelerators for ViTs suffer from frequent off-chip memory access, restricting the achievable throughput by memory bandwidth. In devices with a high compute-to-communication ratio (e.g., edge FPGAs with limited bandwidth), off-chip memory access imposes a severe bottleneck on overall throughput. This work proposes ME-ViT, a novel Memory Efficient FPGA accelerator for ViT inference that minimizes memory traffic. We propose a single-load policy in designing ME-ViT: model parameters are only loaded once, intermediate results are stored on-chip, and all operations are implemented in a single processing element. To achieve this goal, we design a memory-efficient processing element (ME-PE), which processes multiple key operations of ViT inference on the same architecture through the reuse of multi-purpose buffers. We also integrate the Softmax and LayerNorm functions into the ME-PE, minimizing stalls between matrix multiplications. We evaluate ME-ViT on systolic array sizes of 32 and 16, achieving up to a 9.22 and 17.89 overall improvement in memory bandwidth, and a 2.16 improvement in throughput per DSP for both designs over state-of-the-art ViT accelerators on FPGA. ME-ViT achieves a power efficiency improvement of up to 4.00 (1.03) over a GPU (FPGA) baseline. ME-ViT enables up to 5 ME-PE instantiations on a Xilinx Alveo U200, achieving a 5.10 improvement in throughput over the state-of-the art FPGA baseline, and a 5.85 (1.51) improvement in power efficiency over the GPU (FPGA) baseline.

Tasks

GPU

ME-ViT: A Single-Load Memory-Efficient FPGA Accelerator for Vision Transformers

Abstract

Tasks

Reproductions