SOTAVerified

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

2021-12-08NA 2021Code Available2· sign in to hype

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent SIfre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

Tasks

Abstract AlgebraAnachronismsAnalogical SimilarityAnalytic EntailmentAnatomyAstronomyBIG-bench Machine LearningBusiness EthicsCausal JudgmentCheckmate In OneClinical KnowledgeCode Line DescriptionsCollege BiologyCollege ChemistryCollege Computer ScienceCollege MathematicsCollege MedicineCollege PhysicsCommon Sense ReasoningComputer SecurityConceptual PhysicsCrash BlossomCrass AIDark Humor DetectionDate UnderstandingDisambiguation QADiscourse Marker PredictionEconometricsElectrical EngineeringElementary MathematicsEmotional IntelligenceEmpirical JudgmentsEnglish ProverbsEntailed PolarityEpistemic ReasoningEthicsEvaluating Information EssentialityFact CheckingFantasy ReasoningFEVER (2-way)FEVER (3-way)Figure Of Speech DetectionFormal Fallacies Syllogisms NegationFormal LogicGeneral KnowledgeGlobal FactsGRE Reading ComprehensionHellaSwagHigh School BiologyHigh School ChemistryHigh School Computer ScienceHigh School European HistoryHigh School GeographyHigh School Government and PoliticsHigh School MacroeconomicsHigh School MathematicsHigh School MicroeconomicsHigh School PhysicsHigh School PsychologyHigh School StatisticsHigh School US HistoryHigh School World HistoryHindu KnowledgeHuman AgingHuman Organs Senses Multiple ChoiceHuman SexualityHyperbatonIdentify Odd MetaporImplicaturesImplicit RelationsIntelligent CommunicationIntent RecognitionInternational LawIrony IdentificationJurisprudenceKnown UnknownsLAMBADALanguage ModelingLanguage ModellingLogical ArgsLogical FallaciesLogical Fallacy DetectionLogical ReasoningLogical SequenceLogic Grid PuzzleManagementMarketingMathematical InductionMathematical ReasoningMedical GeneticsMemorizationMetaphor BooleanMiscellaneousMisconceptionsMoral DisputesMoral PermissibilityMoral ScenariosMovie Dialog Same Or DifferentMovie Genre Recommendation SystemMovie RecommendationMultiple Choice Question Answering (MCQA)Multi-task Language UnderstandingNatural QuestionsNavigateNonsense Words GrammarNovel ConceptsNutritionOdd One OutPenguins In A TablePhilosophyPhrase RelatednessPhysical IntuitionPhysics MCPrehistoryPresuppositions As NLIProfessional AccountingProfessional LawProfessional MedicineProfessional PsychologyPublic RelationsQuestion AnsweringQuestion SelectionRACE-hRACE-mReading ComprehensionReasoning About Colored ObjectsRiddle SenseRuin NamesSarcasm DetectionSecurity StudiesSentence AmbiguitySentence CompletionSimilarities AbstractionSNARKSSociologySports UnderstandingStrategyQATemporal SequencesTimedialTriviaQAUnderstanding FablesUS Foreign PolicyVirologyWinograndeWinowhyWord Sense DisambiguationWorld Religions

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
BIG-bench (Causal Judgment)Gopher-280B (few-shot, k=5)Accuracy50.8Unverified
BIG-bench (Date Understanding)Gopher-280B (few-shot, k=5)Accuracy44.1Unverified
BIG-bench (Disambiguation QA)Gopher-280B (few-shot, k=5)Accuracy45.5Unverified
BIG-bench (Known Unknowns)Gopher-280B (few-shot, k=5)Accuracy63.6Unverified
BIG-bench (Logical Sequence)Gopher-280B (few-shot, k=5)Accuracy36.4Unverified
BIG-bench (Sports Understanding)Gopher-280B (few-shot, k=5)Accuracy54.9Unverified
BIG-bench (Winowhy)Gopher-280B (few-shot, k=5)Accuracy56.7Unverified
WinoGrandeGopher 280B (0-shot)Accuracy70.1Unverified

Reproductions