GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/ofa-sys/ofapytorch★ 2,557
- github.com/alibaba/EasyNLPjax★ 2,181
- github.com/jsalt18-sentence-repl/jiantIn paperpytorch★ 1,674
- github.com/benzakenelad/BitFitpytorch★ 143
- github.com/ashi-ta/speechgluenone★ 13
- github.com/smallbenchnlp/benchmarknone★ 11
- github.com/colinzhaoust/intrinsic_fewshot_hardnessnone★ 4
- github.com/nyu-mll/GLUE-baselinesIn paperpytorch★ 0
- github.com/kainoj/run_gluepytorch★ 0
- github.com/nvshrao/Pytorch-GLUEpytorch★ 0
Abstract
For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| MultiNLI | Multi-task BiLSTM + Attn | Matched | 72.2 | — | Unverified |