Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

2026-03-16Unverified0· sign in to hype

Fan Huang, Haewoon Kwak, Jisun An

Unverified — Be the first to reproduce this paper.

Abstract

Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce moral reasoning trajectories, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29 more susceptible to persuasive attacks (p=0.015). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly (r=0.715, p<0.0001) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity = 0.859).

Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Abstract

Reproductions