Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts

2019-07-01ACL 2019Unverified0· sign in to hype

Elizabeth Clark, Asli Celikyilmaz, Noah A. Smith

Unverified — Be the first to reproduce this paper.

Abstract

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover's similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover's similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

Tasks

Reinforcement Learning Semantic Similarity Semantic Textual Similarity Sentence Sentence Embeddings

Sentence Mover's Similarity: Automatic Evaluation for Multi-Sentence Texts

Abstract

Tasks

Reproductions