Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

2024-05-03Code Available0· sign in to hype

Hsuvas Borkakoty, Luis Espinosa-Anke

Code Available — Be the first to reproduce this paper.

Code

github.com/hsuvas/hoaxpedia_dataset
OfficialIn paperpytorch★ 0

Abstract

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.

Tasks

Articles Binary Classification Binary text classification text-classification Text Classification

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

Code

Abstract

Tasks

Reproductions