Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

2026-03-15Code Available0· sign in to hype

Gagan Bhatia, Somayajulu G Sripada, Kevin Allan, Jacobo Azcona

Code Available — Be the first to reproduce this paper.

Code

github.com/gagan3012/ltr
Official★ 0

Abstract

Hallucinations in large language models (LLMs) produce fluent continuations that are not supported by the prompt, especially under minimal contextual cues and ambiguity. We introduce Distributional Semantics Tracing (DST), a model-native method that builds layer-wise semantic maps at the answer position by decoding residual-stream states through the unembedding, selecting a compact top-K concept set, and estimating directed concept-to-concept support via lightweight causal tracing. Using these traces, we test a representation-level hypothesis: hallucinations arise from correlation-driven representational drift across depth, where the residual stream is pulled toward a locally coherent but context-inconsistent concept neighborhood reinforced by training co-occurrences. On Racing Thoughts dataset, DST yields more faithful explanations than attribution, probing, and intervention baselines under an LLM-judge protocol, and the resulting Contextual Alignment Score (CAS) strongly predicts failures, supporting this drift hypothesis.

Distributional Semantics Tracing: A Framework for Explaining Hallucinations in Large Language Models

Code

Abstract

Reproductions