The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASRLLM Pipelines?
Jayadev Billa
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Speech LLMs are widely understood to be better than ASRLLM cascades since they have access to the audio directly, and not just the transcript. In this paper, we present an evaluation methodology and a mechanistic interpretation of the observed behavior of speech LLMs. First, we introduce matched-backbone testing which separates out the behavior of the speech LLM from the reasoning capabilities of the underlying LLM. Second, we provide a mechanistic analysis of speech LLMs using logit lens and LEACE and show the literal transcript emerging from the LLM's hidden states and that text representations are causally necessary. We also show that in most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0dB.