Inferring the source of official texts: can SVM beat ULMFiT?

2020-03-02International Conference on Computational Processing of the Portuguese Language 2020Code Available0· sign in to hype

Pedro Henrique Luz de Araujo, Teófilo Emidio de Campos, Marcelo Magalhães Silva de Sousa

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/peluz/kneedle-exploration
none★ 1

Abstract

Official Gazettes are a rich source of relevant information to the public. Their careful examination may lead to the detection of frauds and irregularities that may prevent mismanagement of public funds. This paper presents a dataset composed of documents from the Official Gazette of the Federal District, containing both samples with document source annotation and unlabeled ones. We train, evaluate and compare a transfer learning based model that uses ULMFiT with traditional bag-of-words models that use SVM and Naive Bayes as classifiers. We find the SVM to be competitive, its performance being marginally worse than the ULMFiT while having much faster train and inference time and being less computationally expensive. Finally, we conduct ablation analysis to assess the performance impact of the ULMFiT parts.

Tasks

Text Classification Transfer Learning

Inferring the source of official texts: can SVM beat ULMFiT?

Code

Abstract

Tasks

Reproductions