Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021-12-08NA 2021Code Available2· sign in to hype

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent SIfre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving

arXiv PDF

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

Tasks

Abstract Algebra Anachronisms Analogical Similarity Analytic Entailment Anatomy Astronomy BIG-bench Machine Learning Business Ethics Causal Judgment Checkmate In One Clinical Knowledge Code Line Descriptions College Biology College Chemistry College Computer Science College Mathematics College Medicine College Physics Common Sense Reasoning Computer Security Conceptual Physics Crash Blossom Crass AI Dark Humor Detection Date Understanding Disambiguation QA Discourse Marker Prediction Econometrics Electrical Engineering Elementary Mathematics Emotional Intelligence Empirical Judgments English Proverbs Entailed Polarity Epistemic Reasoning Ethics Evaluating Information Essentiality Fact Checking Fantasy Reasoning FEVER (2-way)FEVER (3-way)Figure Of Speech Detection Formal Fallacies Syllogisms Negation Formal Logic General Knowledge Global Facts GRE Reading Comprehension HellaSwag High School Biology High School Chemistry High School Computer Science High School European History High School Geography High School Government and Politics High School Macroeconomics High School Mathematics High School Microeconomics High School Physics High School Psychology High School Statistics High School US History High School World History Hindu Knowledge Human Aging Human Organs Senses Multiple Choice Human Sexuality Hyperbaton Identify Odd Metapor Implicatures Implicit Relations Intelligent Communication Intent Recognition International Law Irony Identification Jurisprudence Known Unknowns LAMBADA Language Modeling Language Modelling Logical Args Logical Fallacies Logical Fallacy Detection Logical Reasoning Logical Sequence Logic Grid Puzzle Management Marketing Mathematical Induction Mathematical Reasoning Medical Genetics Memorization Metaphor Boolean Miscellaneous Misconceptions Moral Disputes Moral Permissibility Moral Scenarios Movie Dialog Same Or Different Movie Genre Recommendation System Movie Recommendation Multiple Choice Question Answering (MCQA)Multi-task Language Understanding Natural Questions Navigate Nonsense Words Grammar Novel Concepts Nutrition Odd One Out Penguins In A Table Philosophy Phrase Relatedness Physical Intuition Physics MC Prehistory Presuppositions As NLI Professional Accounting Professional Law Professional Medicine Professional Psychology Public Relations Question Answering Question Selection RACE-h RACE-m Reading Comprehension Reasoning About Colored Objects Riddle Sense Ruin Names Sarcasm Detection Security Studies Sentence Ambiguity Sentence Completion Similarities Abstraction SNARKS Sociology Sports Understanding StrategyQA Temporal Sequences Timedial TriviaQA Understanding Fables US Foreign Policy Virology Winogrande Winowhy Word Sense Disambiguation World Religions

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
BIG-bench (Causal Judgment)	Gopher-280B (few-shot, k=5)	Accuracy	50.8	—	Unverified
BIG-bench (Date Understanding)	Gopher-280B (few-shot, k=5)	Accuracy	44.1	—	Unverified
BIG-bench (Disambiguation QA)	Gopher-280B (few-shot, k=5)	Accuracy	45.5	—	Unverified
BIG-bench (Known Unknowns)	Gopher-280B (few-shot, k=5)	Accuracy	63.6	—	Unverified
BIG-bench (Logical Sequence)	Gopher-280B (few-shot, k=5)	Accuracy	36.4	—	Unverified
BIG-bench (Sports Understanding)	Gopher-280B (few-shot, k=5)	Accuracy	54.9	—	Unverified
BIG-bench (Winowhy)	Gopher-280B (few-shot, k=5)	Accuracy	56.7	—	Unverified
WinoGrande	Gopher 280B (0-shot)	Accuracy	70.1	—	Unverified

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Code

Abstract

Tasks

Benchmark Results

Reproductions