Training Dynamics for Text Summarization Models

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Abstract

Pre-trained language models (e.g. BART) have shown impressive results when fine-tuned on large summarization datasets. However, little is understood about this fine-tuning process, including what knowledge is retained from pre-training models or how content selection and generation strategies are learnt across iterations. In this work, we analyze the training dynamics for generation models, focusing on news summarization. Across different datasets (CNN/DM, XSum, MediaSum) and model behaviors (content selection, abstractiveness, hallucination), we study what the model learns at different stages of its fine-tuning process. We find that properties such as copy behavior and content selection are learnt earlier in the training process and these observations are robust across domains. On the other hand, factual errors, such as hallucination of unsupported facts, are learnt in the later stages, and this behavior is more varied across domains. Based on these observations, we demonstrate two techniques for modifying training: first, disregarding high-loss tokens that are challenging to learn and second, disregarding low-loss tokens that are learnt very quickly. We show that these simple modifications can help achieve different goals, such as improving factuality or improving abstractiveness.

Tasks

Hallucination News Summarization Text Summarization

Training Dynamics for Text Summarization Models

Abstract

Tasks

Reproductions