SOTAVerified

Why We Need New Evaluation Metrics for NLG

2017-07-21EMNLP 2017Code Available0· sign in to hype

Jekaterina Novikova, Ondřej Dušek, Amanda Cercas Curry, Verena Rieser

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

The majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

Tasks

Reproductions