SOTAVerified

Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

2026-03-10Unverified0· sign in to hype

Abdulrahman Alswaidan, Jeffrey D. Varner

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields stochastic attention: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We derive a closed-form entropy inflection condition that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law β^*\!\!d for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is 2.6 more novel and 2.0 more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves 6.9 lower amino acid composition divergence than the VAE (KL = 0.060 vs.\ 0.416) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested (K = 100 to 3,500), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.

Reproductions