SOTAVerified

Do large language models and humans have similar behaviors in causal inference with script knowledge?

2023-11-13Code Available0· sign in to hype

Xudong Hong, Margarita Ryzhova, Daniel Adrian Biondi, Vera Demberg

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event B in a script-based story, which causally depends on a previous event A. In our manipulation, event A is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist ( A B) than under logical conditions (A B). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the A B condition. 2) Despite this correlation, all models still fail to predict that nil B is less surprising than A B, indicating that LLMs still have difficulties integrating script knowledge. Our code and collected data set are available at https://github.com/tony-hong/causal-script.

Tasks

Reproductions