How Neural Networks Organize Concepts: Introducing Concept Trajectory Analysis for Deep Learning Interpretability
Andrew Smigaj
Code Available — Be the first to reproduce this paper.
ReproduceCode
Abstract
We present Concept Trajectory Analysis (CTA), an interpretability method that tracks how neural networks organize concepts by following their paths through clustered activation spaces across layers. Applying CTA to GPT-2 with 1,228 single-token words revealed that the model organizes language primarily by grammatical function rather than semantic meaning. We found that 48.5% of words converge to grammatical highways where nouns—whether animals, objects, or abstracts—travel together, while maintaining semantic distinctions at finer scales (χ2 = 95.90, p < 0.0001). CTA combines geometric clustering with trajectory tracking to quantify how concepts flow through networks. Our method introduces windowed analysis to identify phase transitions (semantic→grammatical in GPT-2) and leverages LLMs to generate interpretable cluster labels. In medical AI, CTA exposed how a heart disease model stratifies patients through risk pathways, revealing demographic biases (male overprediction in Path 4, 83% male composition). By making neural organization visible and quantifiable, CTA provides actionable insights for model debugging, bias detection, and scientific understanding of deep learning. Our open-source implementation enables researchers to apply CTA to any neural network, advancing interpretable AI across domains.