SOTAVerified

A Large Self-Annotated Corpus for Sarcasm

2017-04-19LREC 2018Code Available0· sign in to hype

Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
SARC (all-bal)Bag-of-BigramsAccuracy75.8Unverified
SARC (pol-bal)Bag-of-BigramsAccuracy76.5Unverified
SARC (pol-unbal)Bag-of-WordsAvg F127Unverified

Reproductions