SOTAVerified

IndicBART: A Pre-trained Model for Indic Natural Language Generation

2021-11-16ACL ARR November 2021Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

We study pre-trained sequence-to-sequence model for a specific-language family with a focus on Indic languages. We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English. IndicBART utilizes the orthographic similarity between Indic scripts to improve transfer learning between similar Indic languages. We evaluate IndicBART on two NLG tasks: Neural Machine Translation (NMT) and extreme summarization. Our experiments on NMT and extreme summarization show that a language family-specific model like IndicBART is competitive with large pre-trained models like mBART50 despite being significantly smaller. It also performs well on very low-resource translation scenarios: languages not included in pre-training or fine-tuning. Script sharing, multilingual training and better utilization of limited model capacity contribute to the good performance of the compact IndicBART model.

Tasks

Reproductions