Mechanisms of Introspective Awareness

2026-03-22Code Available0· sign in to hype

Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey

Code Available — Be the first to reproduce this paper.

Code

github.com/safety-research/introspection-mechanisms
OfficialIn paper★ 0

Abstract

Recent work shows that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept, a phenomenon cited as evidence of "introspective awareness." But what mechanisms underlie this capability, and do they reflect genuine introspective circuitry or more shallow heuristics? We investigate these questions in open-source models and establish three main findings. First, introspection is behaviorally robust: detection achieves moderate true positive rates with 0% false positives across diverse prompts. We also find this capability emerges specifically from post-training rather than pretraining. Second, introspection is not reducible to a single linear confound: anomaly detection relies on distributed MLP computation across multiple directions, implemented by evidence carrier and gate features. Third, models possess greater introspective capability than is elicited by default: ablating refusal directions improves detection by 53pp and a trained steering vector by 75pp. Overall, our results suggest that introspective awareness is behaviorally robust, grounded in nontrivial internal anomaly detection, and likely could be substantially improved in future models. Code: https://github.com/safety-research/introspection-mechanisms.

Mechanisms of Introspective Awareness

Code

Abstract

Reproductions