Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models
2024-06-19Unverified0· sign in to hype
Stefan Pasch, Dimitirios Petridis, Jannic Cutura
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.