Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

2024-06-04Code Available1· sign in to hype

Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov

Code Available — Be the first to reproduce this paper.

Code

github.com/ablghtianyi/ICL_Modular_Arithmetic
OfficialIn paperpytorch★ 19

Abstract

Large language models can solve tasks that were not present in the training set. This capability is believed to be due to in-context learning and skill composition. In this work, we study the emergence of in-context learning and skill composition in a collection of modular arithmetic tasks. Specifically, we consider a finite collection of linear modular functions z = a \, x + b \, y \;mod\; p labeled by the vector (a, b) Z_p^2. We use some of these tasks for pre-training and the rest for out-of-distribution testing. We empirically show that a GPT-style transformer exhibits a transition from in-distribution to out-of-distribution generalization as the number of pre-training tasks increases. We find that the smallest model capable of out-of-distribution generalization requires two transformer blocks, while for deeper models, the out-of-distribution generalization phase is transient, necessitating early stopping. Finally, we perform an interpretability study of the pre-trained models, revealing highly structured representations in both attention heads and MLPs; and discuss the learned algorithms. Notably, we find an algorithmic shift in deeper models, as we go from few to many in-context examples.

Tasks

In-Context Learning Out-of-Distribution Generalization

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

Code

Abstract

Tasks

Reproductions