TAPE: Assessing Few-shot Russian Language Understanding

2022-10-23Code Available0· sign in to hype

Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/RussianNLP/TAPE
Officialnone★ 8

Abstract

Recent advances in zero-shot and few-shot learning have shown promise for a scope of research and practical purposes. However, this fast-growing area lacks standardized evaluation suites for non-English languages, hindering progress outside the Anglo-centric paradigm. To address this line of research, we propose TAPE (Text Attack and Perturbation Evaluation), a novel benchmark that includes six more complex NLU tasks for Russian, covering multi-hop reasoning, ethical concepts, logic and commonsense knowledge. The TAPE's design focuses on systematic zero-shot and few-shot NLU evaluation: (i) linguistic-oriented adversarial attacks and perturbations for analyzing robustness, and (ii) subpopulations for nuanced interpretation. The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most, while paraphrasing the input has a more negligible effect. At the same time, the results demonstrate a significant gap between the neural and human baselines for most tasks. We publicly release TAPE (tape-benchmark.com) to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.

Tasks

Adversarial Attack Adversarial Text Ethics Few-Shot Learning Logical Reasoning Question Answering Zero-Shot Learning

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
ETHICS	RuGPT-3 Small	Accuracy	55.5	—	Unverified
ETHICS	RuGPT-3 Meduim	Accuracy	68.3	—	Unverified
ETHICS	RuGPT-3 Large	Accuracy	68.6	—	Unverified
ETHICS	Human benchmark	Accuracy	52.9	—	Unverified
Ethics (per ethics)	RuGPT-3 Large	Accuracy	44.9	—	Unverified
Ethics (per ethics)	RuGPT-3 Medium	Accuracy	44.1	—	Unverified
Ethics (per ethics)	RuGPT-3 Small	Accuracy	60.9	—	Unverified
Ethics (per ethics)	Human benchmark	Accuracy	67.6	—	Unverified

TAPE: Assessing Few-shot Russian Language Understanding

Code

Abstract

Tasks

Benchmark Results

Reproductions