A Large Self-Annotated Corpus for Sarcasm
Mikhail Khodak, Nikunj Saunshi, Kiran Vodrahalli
Code Available — Be the first to reproduce this paper.
ReproduceCode
Abstract
We introduce the Self-Annotated Reddit Corpus (SARC), a large corpus for sarcasm research and for training and evaluating systems for sarcasm detection. The corpus has 1.3 million sarcastic statements -- 10 times more than any previous dataset -- and many times more instances of non-sarcastic statements, allowing for learning in both balanced and unbalanced label regimes. Each statement is furthermore self-annotated -- sarcasm is labeled by the author, not an independent annotator -- and provided with user, topic, and conversation context. We evaluate the corpus for accuracy, construct benchmarks for sarcasm detection, and evaluate baseline methods.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| SARC (all-bal) | Bag-of-Bigrams | Accuracy | 75.8 | — | Unverified |
| SARC (pol-bal) | Bag-of-Bigrams | Accuracy | 76.5 | — | Unverified |
| SARC (pol-unbal) | Bag-of-Words | Avg F1 | 27 | — | Unverified |