SOTAVerified

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies

2025-03-13Code Available0· sign in to hype

Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, and Pinzhen Chen, Mariia Fedorova, Liane Guillou, Barry Haddow, Jan Hajič, and Jindřich Helcl, Erik Henriksson, Mateusz Klimaszewski, Ville Komulainen, and Andrey Kutuzov, Joona Kytöniemi, Veronika Laippala, Petter Mæhlum, and Bhavitvya Malik, Farrokh Mehryary, Vladislav Mikhailov, Nikita Moghe, and Amanda Myntti, Dayyán O'Brien, Stephan Oepen, Proyag Pal, Jousia Piha, and Sampo Pyysalo, Gema Ramírez-Sánchez, David Samuel, Pavel Stepachev, and Jörg Tiedemann, Dušan Variš, Tereza Vojtěchová, Jaume Zaragoza-Bernabeu

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Training state-of-the-art large language models requires vast amounts of clean and diverse textual data. However, building suitable multilingual datasets remains a challenge. In this work, we present HPLT v2, a collection of high-quality multilingual monolingual and parallel corpora. The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. We document the entire data pipeline and release the code to reproduce it. We provide extensive analysis of the quality and characteristics of our data. Finally, we evaluate the performance of language models and machine translation systems trained on HPLT v2, demonstrating its value.

Tasks

Reproductions