SOTAVerified

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

2026-03-13Unverified0· sign in to hype

Jinman Wu, Yi Xie, Shen Lin, Shiqian Zhao, Xiaofeng Chen

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

Safety alignment is often conceptualized as a monolithic process wherein harmfulness detection automatically triggers refusal. However, the persistence of jailbreak attacks suggests a fundamental mechanistic decoupling. We propose the Disentangled Safety Hypothesis (DSH), positing that safety computation operates on two distinct subspaces: a Recognition Axis (v_H, ``Knowing'') and an Execution Axis (v_R, ``Acting''). Our geometric analysis reveals a universal ``Reflex-to-Dissociation'' evolution, where these signals transition from antagonistic entanglement in early layers to structural independence in deep layers. To validate this, we introduce Double-Difference Extraction and Adaptive Causal Steering. Using our curated AmbiguityBench, we demonstrate a causal double dissociation, effectively creating a state of ``Knowing without Acting.'' Crucially, we leverage this disentanglement to propose the Refusal Erasure Attack (REA), which achieves State-of-the-Art attack success rates by surgically lobotomizing the refusal mechanism. Furthermore, we uncover a critical architectural divergence, contrasting the Explicit Semantic Control of Llama3.1 with the Latent Distributed Control of Qwen2.5. The code and dataset are available at https://anonymous.4open.science/r/DSH.

Reproductions