SOTAVerified

Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts

2019-07-01ACL 2019Unverified0· sign in to hype

Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover's similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

Tasks

Reproductions