Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
Marko Karbevski, Antonij Mijoski
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits become linear rather than quadratic in learned weights. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that with adjusted attention scaling and weight decay, reduced models match baseline performance despite fewer parameters. Training remains stable at over 3 lower weight decay, suggesting Query weight elimination provides implicit regularization. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function class at fixed width. These findings motivate investigation across modalities and at scale, where the observed stability and efficiency gains may prove most consequential.