Evaluating Large Language Models with fmeval

2024-07-15Code Available3· sign in to hype

Pola Schwöbel, Luca Franceschi, Muhammad Bilal Zafar, Keerthan Vasist, Aman Malhotra, Tomer Shenhar, Pinal Tailor, Pinar Yilmaz, Michael Diamond, Michele Donini

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/aws/fmeval
OfficialIn papernone★ 278

Abstract

fmeval is an open source library to evaluate large language models (LLMs) in a range of tasks. It helps practitioners evaluate their model for task performance and along multiple responsible AI dimensions. This paper presents the library and exposes its underlying design principles: simplicity, coverage, extensibility and performance. We then present how these were implemented in the scientific and engineering choices taken when developing fmeval. A case study demonstrates a typical use case for the library: picking a suitable model for a question answering task. We close by discussing limitations and further work in the development of the library. fmeval can be found at https://github.com/aws/fmeval.

Tasks

Question Answering

Evaluating Large Language Models with fmeval

Code

Abstract

Tasks

Reproductions