Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

2025-02-03Unverified0· sign in to hype

Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into -1, +1 vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just 1.78\% performance losses on GLUE compared to 9.08\% in state-of-the-art binarization work, and 2.5\% performance losses on ImageNet compared to 12.14\%, all while targeting custom hardware with a 79\% area reduction and 87\% power reduction compared to its standard attention counterpart.

Tasks

Binarization

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

Abstract

Tasks

Reproductions