Emergent Linear Representations in World Models of Self-Supervised Sequence Models

2023-09-02Code Available1· sign in to hype

Neel Nanda, Andrew Lee, Martin Wattenberg

Code Available — Be the first to reproduce this paper.

Code

github.com/ajyl/mech_int_othellogpt
OfficialIn paperjax★ 10
github.com/alxndrtl/othello_mamba
pytorch★ 49

Abstract

How do sequence models represent their decision-making process? Prior work suggests that Othello-playing neural network learned nonlinear models of the board state (Li et al., 2023). In this work, we provide evidence of a closely related linear representation of the board. In particular, we show that probing for "my colour" vs. "opponent's colour" may be a simple yet powerful way to interpret the model's internal state. This precise understanding of the internal representations allows us to control the model's behaviour with simple vector arithmetic. Linear representations enable significant interpretability progress, which we demonstrate with further exploration of how the world model is computed.

Tasks

Decision Making

Emergent Linear Representations in World Models of Self-Supervised Sequence Models

Code

Abstract

Tasks

Reproductions