SOTAVerified

SumPubMed: Summarization Dataset of PubMed Scientific Articles

2021-08-01ACL 2021Code Available1· sign in to hype

Vivek Gupta, Prerna Bharti, Pegah Nokhiz, Harish Karnick

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Most earlier work on text summarization is carried out on news article datasets. The summary in these datasets is naturally located at the beginning of the text. Hence, a model can spuriously utilize this correlation for summary generation instead of truly learning to summarize. To address this issue, we constructed a new dataset, SumPubMed , using scientific articles from the PubMed archive. We conducted a human analysis of summary coverage, redundancy, readability, coherence, and informativeness on SumPubMed . SumPubMed is challenging because (a) the summary is distributed throughout the text (not-localized on top), and (b) it contains rare domain-specific scientific terms. We observe that seq2seq models that adequately summarize news articles struggle to summarize SumPubMed . Thus, SumPubMed opens new avenues for the future improvement of models as well as the development of new evaluation metrics.

Tasks

Reproductions