Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

2016-12-01COLING 2016Unverified0· sign in to hype

Leon Derczynski, Kalina Bontcheva, Ian Roberts

Unverified — Be the first to reproduce this paper.

Abstract

One of the main obstacles, hampering method development and comparative evaluation of named entity recognition in social media, is the lack of a sizeable, diverse, high quality annotated corpus, analogous to the CoNLL'2003 news dataset. For instance, the biggest Ritter tweet corpus is only 45,000 tokens -- a mere 15\% the size of CoNLL'2003. Another major shortcoming is the lack of temporal, geographic, and author diversity. This paper introduces the Broad Twitter Corpus (BTC), which is not only significantly bigger, but sampled across different regions, temporal periods, and types of Twitter users. The gold-standard named entity annotations are made by a combination of NLP experts and crowd workers, which enables us to harness crowd recall while maintaining high quality. We also measure the entity drift observed in our dataset (i.e. how entity representation varies over time), and compare to newswire. The corpus is released openly, including source text and intermediate annotations.

Tasks

Diversity named-entity-recognition Named Entity Recognition Named Entity Recognition (NER)

Broad Twitter Corpus: A Diverse Named Entity Recognition Resource

Abstract

Tasks

Reproductions