SOTAVerified

Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models

2024-06-19Unverified0· sign in to hype

Stefan Pasch, Dimitirios Petridis, Jannic Cutura

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.

Tasks

Reproductions