Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

2024-08-19Code Available0· sign in to hype

Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells

Code Available — Be the first to reproduce this paper.

Code

github.com/webis-de/arxiv24-ranking-generated-answers
Officialnone★ 2

Abstract

Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's =0.64). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.

Tasks

Open-Ended Question Answering Question Answering Retrieval Single Choice Question text-classification Text Classification

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions

Code

Abstract

Tasks

Reproductions