PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

2024-02-17Code Available2· sign in to hype

Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, Jordan Lee Boyd-Graber

Code Available — Be the first to reproduce this paper.

Code

github.com/zli12321/qa_metrics
OfficialIn papernone★ 61

Abstract

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods(BERTScore).

Tasks

Benchmarking Form Open-Domain Question Answering Question Answering Text Generation

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Code

Abstract

Tasks

Reproductions