DateLogicQA: Benchmarking Temporal Biases in Large Language Models

2024-12-17Code Available0· sign in to hype

Gagan Bhatia, MingZe Tang, Cristina Mahanta, Madiha Kazi

Code Available — Be the first to reproduce this paper.

Code

github.com/gagan3012/eais-temporal-bias
OfficialIn papernone★ 0

Abstract

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately.

Tasks

Benchmarking

DateLogicQA: Benchmarking Temporal Biases in Large Language Models

Code

Abstract

Tasks

Reproductions