SOTAVerified

Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction

2021-08-01ACL 2021Code Available3· sign in to hype

Adrien Barbaresi

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

An essential operation in web corpus construction consists in retaining the desired content while discarding the rest. Another challenge finding one's way through websites. This article introduces a text discovery and extraction tool published under open-source license. Its installation and use is straightforward, notably from Python and on the command-line. The software allows for main text, comments and metadata extraction, while also providing building blocks for web crawling tasks. A comparative evaluation on real-world data also shows its interest as well as the performance of other available solutions. The contributions of this paper are threefold: it references the software, features a benchmark, and provides a meaningful baseline for similar tasks. The tool performs significantly better than other open-source solutions in this evaluation and in external benchmarks.

Reproductions