SOTAVerified

An Empirical Study on Pre-trained Embeddings and Language Models for Bot Detection

2019-08-01WS 2019Code Available0· sign in to hype

Andres Garcia-Silva, Cristian Berrio, Jos{\'e} Manuel G{\'o}mez-P{\'e}rez

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Fine-tuning pre-trained language models has significantly advanced the state of art in a wide range of NLP downstream tasks. Usually, such language models are learned from large and well-formed text corpora from e.g. encyclopedic resources, books or news. However, a significant amount of the text to be analyzed nowadays is Web data, often from social media. In this paper we consider the research question: How do standard pre-trained language models generalize and capture the peculiarities of rather short, informal and frequently automatically generated text found in social media? To answer this question, we focus on bot detection in Twitter as our evaluation task and test the performance of fine-tuning approaches based on language models against popular neural architectures such as LSTM and CNN combined with pre-trained and contextualized embeddings. Our results also show strong performance variations among the different language model approaches, which suggest further research.

Tasks

Reproductions