Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

2020-07-01WS 2020Unverified0· sign in to hype

Sashank Santhanam, Samira Shaikh

Unverified — Be the first to reproduce this paper.

Abstract

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output

Tasks

Text Generation

Understanding the Impact of Experiment Design for Evaluating Dialogue System Output

Abstract

Tasks

Reproductions