Evaluating the Reliability of Self-Explanations in Large Language Models

2024-07-19Code Available0· sign in to hype

Korbinian Randl, John Pavlopoulos, Aron Henriksson, Tony Lindgren

Code Available — Be the first to reproduce this paper.

Code

github.com/k-randl/self-explaining_llms
OfficialIn paperpytorch★ 1

Abstract

This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations - extractive and counterfactual - using three state-of-the-art LLMs (2B to 8B parameters) on two different classification tasks (objective and subjective). Our findings reveal, that, while these self-explanations can correlate with human judgement, they do not fully and accurately follow the model's decision process, indicating a gap between perceived and actual model reasoning. We show that this gap can be bridged because prompting LLMs for counterfactual explanations can produce faithful, informative, and easy-to-verify results. These counterfactuals offer a promising alternative to traditional explainability methods (e.g. SHAP, LIME), provided that prompts are tailored to specific tasks and checked for validity.

Tasks

counterfactual

Evaluating the Reliability of Self-Explanations in Large Language Models

Code

Abstract

Tasks

Reproductions