GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

2018-04-20WS 2018Code Available1· sign in to hype

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman

Code Available — Be the first to reproduce this paper.

Code

github.com/jsalt18-sentence-repl/jiant
In paperpytorch★ 1,674
github.com/benzakenelad/BitFit
pytorch★ 143
github.com/imran3180/pytorch-nli
pytorch★ 36
github.com/ashi-ta/speechglue
none★ 13
github.com/smallbenchnlp/benchmark
none★ 11
github.com/colinzhaoust/intrinsic_fewshot_hardness
none★ 4
github.com/nyu-mll/GLUE-baselines
In paperpytorch★ 0
github.com/kainoj/run_glue
pytorch★ 0
github.com/nvshrao/Pytorch-GLUE
pytorch★ 0

Abstract

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Tasks

Diagnostic Natural Language Inference Natural Language Understanding QQP Transfer Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
MultiNLI	Multi-task BiLSTM + Attn	Matched	72.2	—	Unverified

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions