SOTAVerified

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

2026-03-15Code Available0· sign in to hype

Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Hallucinations in large language models (LLMs) produce fluent continuations that are not supported by the prompt, especially under minimal contextual cues and ambiguity. We introduce Distributional Semantics Tracing (DST), a model-native method that builds layer-wise semantic maps at the answer position by decoding residual-stream states through the unembedding, selecting a compact top-K concept set, and estimating directed concept-to-concept support via lightweight causal tracing. Using these traces, we test a representation-level hypothesis: hallucinations arise from correlation-driven representational drift across depth, where the residual stream is pulled toward a locally coherent but context-inconsistent concept neighborhood reinforced by training co-occurrences. On Racing Thoughts dataset, DST yields more faithful explanations than attribution, probing, and intervention baselines under an LLM-judge protocol, and the resulting Contextual Alignment Score (CAS) strongly predicts failures, supporting this drift hypothesis.

Reproductions