DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

2026-03-19Unverified0· sign in to hype

Maoyang Xiang, Bo Wang

Unverified — Be the first to reproduce this paper.

Abstract

Non-linear activation functions play a pivotal role in on-device inference and training, as they not only consume substantial hardware resources but also impose a significant impact on system performance and energy efficiency. In this work, we propose Distribution-Aware Piecewise Activation (DAPA), a differentiable and hardware-friendly activation function for Transformer architectures by exploiting the distribution of pre-activation data. DAPA employs a non-uniform piecewise approximation that allocates finer segments to high-probability regions of the distribution, improving generalizability over prior piecewise linear methods. The resulting approximation is further quantized using Distribution-Weighted Mean Square Error to reduce latency and resource utilization for hardware deployment. Our HLS implementation demonstrates that DAPA speeds up GELU computation by 16 and decreases DSP utilization by 16 while maintaining comparable or better performance across vision Transformers and GPT-2 models.

DAPA: Distribution Aware Piecewise Activation Functions for On-Device Transformer Inference and Training

Abstract

Reproductions