TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

2021-09-27Findings (EMNLP) 2021Code Available1· sign in to hype

Adaku Uchendu, Zeyu Ma, Thai Le, Rui Zhang, Dongwon Lee

Code Available — Be the first to reproduce this paper.

Code

github.com/amritabh/conda-gen-text-detection
pytorch★ 41
github.com/amritabh/chatgpt-as-detector
none★ 7
github.com/MindSpore-paper-code-3/code9/tree/main/transformer_xl
mindspore★ 0

Abstract

Recent progress in generative language models has enabled machines to generate astonishingly realistic texts. While there are many legitimate applications of such models, there is also a rising need to distinguish machine-generated texts from human-written ones (e.g., fake news detection). However, to our best knowledge, there is currently no benchmark environment with datasets and tasks to systematically study the so-called "Turing Test" problem for neural text generation methods. In this work, we present the TuringBench benchmark environment, which is comprised of (1) a dataset with 200K human- or machine-generated samples across 20 labels Human, GPT-1, GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3, GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large, FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2, (2) two benchmark tasks -- i.e., Turing Test (TT) and Authorship Attribution (AA), and (3) a website with leaderboards. Our preliminary experimental results using TuringBench show that FAIR_wmt20 and GPT-3 are the current winners, among all language models tested, in generating the most human-like indistinguishable texts with the lowest F1 score by five state-of-the-art TT detection models. The TuringBench is available at: https://turingbench.ist.psu.edu/

Tasks

Authorship Attribution Binary text classification Fake News Detection Text Generation

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
TURINGBENCH (Turing Test, FAIR_wmt20)	RoBERTa	F1 score	0.45	—	Unverified
TURINGBENCH (Turing Test, GPT-3)	RoBERTa	F1 score	0.52	—	Unverified

TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation

Code

Abstract

Tasks

Benchmark Results

Reproductions