Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions
Sebastian Heineking, Jonas Probst, Daniel Steinbach, Martin Potthast, Harrisen Scells
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/webis-de/arxiv24-ranking-generated-answersOfficialnone★ 2
Abstract
Evaluating the output of generative large language models (LLMs) is challenging and difficult to scale. Many evaluations of LLMs focus on tasks such as single-choice question-answering or text classification. These tasks are not suitable for assessing open-ended question-answering capabilities, which are critical in domains where expertise is required. One such domain is health, where misleading or incorrect answers can have a negative impact on a user's well-being. Using human experts to evaluate the quality of LLM answers is generally considered the gold standard, but expert annotation is costly and slow. We present a method for evaluating LLM answers that uses ranking models trained on annotated document collections as a substitute for explicit relevance judgements and apply it to the CLEF 2021 eHealth dataset. In a user study, our method correlates with the preferences of a human expert (Kendall's =0.64). It is also consistent with previous findings in that the quality of generated answers improves with the size of the model and more sophisticated prompting strategies.