Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard; Driessche, George van den; Hendricks, Lisa Anne; Rauh, Maribeth; Huang, Po-Sen; Glaese, Amelia; Welbl, Johannes; Dathathri, Sumanth; Huang, Saffron; Uesato, Jonathan; Mellor, John; Higgins, Irina; Creswell, Antonia; McAleese, Nat; Wu, Amy; Elsen, Erich; Jayakumar, Siddhant; Buchatskaya, Elena; Budden, David; Sutherland, Esme; Simonyan, Karen; Paganini, Michela; Sifre, Laurent; Martens, Lena; Li, Xiang Lorraine; Kuncoro, Adhiguna; Nematzadeh, Aida; Gribovskaya, Elena; Donato, Domenic; Lazaridou, Angeliki; Mensch, Arthur; Lespiau, Jean-Baptiste; Tsimpoukelli, Maria; Grigorev, Nikolai; Fritz, Doug; Sottiaux, Thibault; Pajarskas, Mantas; Pohlen, Toby; Gong, Zhitao; Toyama, Daniel; d'Autume, Cyprien de Masson; Li, Yujia; Terzi, Tayfun; Mikulik, Vladimir; Babuschkin, Igor; Clark, Aidan; Casas, Diego de Las; Guy, Aurelia; Jones, Chris; Bradbury, James; Johnson, Matthew; Hechtman, Blake; Weidinger, Laura; Gabriel, Iason; Isaac, William; Lockhart, Ed; Osindero, Simon; Rimell, Laura; Dyer, Chris; Vinyals, Oriol; Ayoub, Kareem; Stanway, Jeff; Bennett, Lorrayne; Hassabis, Demis; Kavukcuoglu, Koray; Irving, Geoffrey

Computer Science > Computation and Language

arXiv:2112.11446 (cs)

[Submitted on 8 Dec 2021 (v1), last revised 21 Jan 2022 (this version, v2)]

Title:Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Authors:Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks, Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu, Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro, Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d'Autume, Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger, Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, Geoffrey Irving

View PDF

Abstract:Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

Comments:	120 pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2112.11446 [cs.CL]
	(or arXiv:2112.11446v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2112.11446

Submission history

From: Jack Rae [view email]
[v1] Wed, 8 Dec 2021 19:41:47 UTC (7,844 KB)
[v2] Fri, 21 Jan 2022 18:39:38 UTC (7,941 KB)

Computer Science > Computation and Language

Title:Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Submission history

Access Paper:

References & Citations

6 blog links

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Submission history

Access Paper:

References & Citations

6 blog links

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators