EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

2024-09-26Code Available0· sign in to hype

Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayyán O'Brien, Hengyu Luo, Hinrich Schütze, Jörg Tiedemann, Barry Haddow

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/MaLA-LM/emma-500
Officialnone★ 4

Abstract

In this work, we introduce EMMA-500, a large-scale multilingual language model continue-trained on texts across 546 languages designed for enhanced multilingual performance, focusing on improving language coverage for low-resource languages. To facilitate continual pre-training, we compile the MaLA corpus, a comprehensive multilingual dataset enriched with curated datasets across diverse domains. Leveraging this corpus, we conduct extensive continual pre-training of the Llama 2 7B model, resulting in EMMA-500, which demonstrates robust performance across a wide collection of benchmarks, including a comprehensive set of multilingual tasks and PolyWrite, an open-ended generation benchmark developed in this study. Our results highlight the effectiveness of continual pre-training in expanding large language models' language capacity, particularly for underrepresented languages, demonstrating significant gains in cross-lingual transfer, task generalization, and language adaptability.

Tasks

Cross-Lingual Transfer Language Modeling Language Modelling

EMMA-500: Enhancing Massively Multilingual Adaptation of Large Language Models

Code

Abstract

Tasks

Reproductions