HyenaPixel: Global Image Context with Convolutions

2024-02-29Code Available0· sign in to hype

Julian Spravil, Sebastian Houben, Sven Behnke

Code Available — Be the first to reproduce this paper.

Code

github.com/spravil/HyenaPixel
Officialpytorch★ 4

Abstract

In computer vision, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, its quadratic complexity limits its applicability to tasks that benefit from high-resolution input. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to bidirectional data and two-dimensional image space. We scale Hyena's convolution kernels beyond the feature map size, up to 191191, to maximize ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 84.9% and 85.2%, respectively, with no additional training data, while outperforming other convolutional and large-kernel networks. Combining HyenaPixel with attention further improves accuracy. We attribute the success of bidirectional Hyena to learning the data-dependent geometric arrangement of pixels without a fixed neighborhood definition. Experimental results on downstream tasks suggest that HyenaPixel with large filters and a fixed neighborhood leads to better localization performance.

Tasks

Image Classification Object Detection Semantic Segmentation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ImageNet	HyenaPixel-Bidirectional-Former-B36	Top 1 Accuracy	85.2	—	Unverified
ImageNet	HyenaPixel-Former-B36	Top 1 Accuracy	84.9	—	Unverified
ImageNet	HyenaPixel-Attention-Former-S18	Top 1 Accuracy	83.6	—	Unverified
ImageNet	HyenaPixel-Bidirectional-Former-S18	Top 1 Accuracy	83.5	—	Unverified
ImageNet	HyenaPixel-Former-S18	Top 1 Accuracy	83.2	—	Unverified

HyenaPixel: Global Image Context with Convolutions

Code

Abstract

Tasks

Benchmark Results

Reproductions