RAFT: A Real-World Few-Shot Text Classification Benchmark

2021-09-28Code Available1· sign in to hype

Neel Alex, Eli Lifland, Lewis Tunstall, Abhishek Thakur, Pegah Maham, C. Jess Riedel, Emmie Hine, Carolyn Ashurst, Paul Sedille, Alexis Carlier, Michael Noetel, Andreas Stuhlmüller

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

github.com/oughtinc/raft-baselines
Officialnone★ 15

Abstract

Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don't directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at https://raft.elicit.org .

Tasks

Classification Few-Shot Learning Few-Shot Text Classification text-classification Text Classification

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
RAFT	Human (crowdsourced)	Avg	0.74	—	Unverified
RAFT	GPT-3	Avg	0.63	—	Unverified
RAFT	AdaBoost	Avg	0.51	—	Unverified
RAFT	GPT-Neo	Avg	0.48	—	Unverified
RAFT	GPT-2	Avg	0.46	—	Unverified
RAFT	BART MNLI zero-shot	Avg	0.38	—	Unverified
RAFT	Plurality-class	Avg	0.33	—	Unverified
RAFT	GPT-3 zero-shot	Avg	0.29	—	Unverified

RAFT: A Real-World Few-Shot Text Classification Benchmark

Code

Abstract

Tasks

Benchmark Results

Reproductions