0% found this document useful (0 votes)
9 views255 pages

(10 December 2024, NeurIPS) Tutorial On Language Modeling

df

Uploaded by

sriboston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views255 pages

(10 December 2024, NeurIPS) Tutorial On Language Modeling

df

Uploaded by

sriboston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 255

Language Modeling

Kyle Lo – Akshita Bhagia – Nathan Lambert


Allen Institute of AI
[email protected]

Neural Information Processing Systems (NeurIPS)


10 December 2024

1
Open Closed API
Models Models

ELMO BERT GPT-2 GPT-3 Chinchilla ChatGPT GPT-4


Feb 2018 Oct 2018 Feb 2019 June 2020 March 2022 Nov 2022 March 2023

“Self-supervised Pretrain “In context


Generative “Data size is “Multimodal”
LM helps & fine-tune learning”
downstream” as important as
parameter count”

*List not exhaustive 2


AI is here today due to open scientific
practices and fully open models

3
Are we done with scientific research
on LMs?

4
Goal of this tutorial is to build foundational
understanding for LM research.
Outline:
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

Lo, Bhagia, Lambert – Language Modeling Tutorial 5


Minimal LM basics

6
Prerequisites
We’re assuming you are comfortable with:

● Training ML models
○ e.g., “learning rate schedulers”, “AdamW”, “batch size”, “transformers”
● Core LM concepts
○ e.g. “next word prediction”, “tokenization”, “sequence length”
● PyTorch*

*Treat our code snippets like pseudocode, no guarantees they will run!

Lo, Bhagia, Lambert – Language Modeling Tutorial 7


Input and output tensors

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 8


Closer look at inputs

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 9


Closer look at inputs

Lo, Bhagia, Lambert – Language Modeling Tutorial 10


Closer look at inputs

Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 11


Closer look at inputs

Tokenizer

Input
embeddings

Lo, Bhagia, Lambert – Language Modeling Tutorial 12


Input and output tensors

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 13


Closer look at outputs

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 14


Closer look at outputs
Logits

Lo, Bhagia, Lambert – Language Modeling Tutorial 15


Closer look at outputs
Logits

argmax

Lo, Bhagia, Lambert – Language Modeling Tutorial 16


Closer look at outputs
Logits

argmax

Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 17


Input and output tensors

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, …)

Model
● Becomes tensor of dimension
(batch_size, seq_len, …)
● Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 18


Training should otherwise look familiar…

Data tensor
shapes

Data tensor
shapes

Lo, Bhagia, Lambert – Language Modeling Tutorial 19


Speaker: Kyle Lo

Base – Data
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

20
What is LM data?

21
Looking at the data

Lo, Bhagia, Lambert – Language Modeling Tutorial 22


Organized
hierarchically

Unstructured
text

Lo, Bhagia, Lambert – Language Modeling Tutorial 23


Even
structured
tasks as one
long string

Lo, Bhagia, Lambert – Language Modeling Tutorial 24


The data curation loop

acquire data

transform the data Repeat per source


(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 25


001 What is “good” data?

002 Data acquisition

003 Data transformation

004 Making good decisions

005 Takeaways

26
What is “good” data?

27
non iid

Scale “Quality”
ac

ess
ce

acc
s s

Constraints

Lo, Bhagia, Lambert – Language Modeling Tutorial 28


Data acquisition

29
The data curation loop

acquire data

transform the data


(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 30


Public APIs & Bulk Dumps
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath


1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 31
Largest contributors of data?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath


1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP'’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 32
Other major contributors of data?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath


1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 33
Data providers breakdown
● Web scrapers (80-100% of the data)
○ Internet Archive (1996), Common Crawl (2007), PushShift (2015),
Software Heritage (2016)

● User-provided content (<1%)


○ Wikipedia, arXiv

● Open publishers (<5%)


○ PubMed, Project Gutenberg, Semantic Scholar

Lo, Bhagia, Lambert – Language Modeling Tutorial 34


Datasets also reuse prior datasets
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath


1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 35
What about the crawling data yourself?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath


1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 36
Crawling & Scraping

37
Broad & wide crawls are Domain-specific crawls are
easiest to scale easiest to ensure quality

Common Crawl, Internet Archive, Math exercises, code notebooks,


Software Heritage, big tech company Q&A forum posts, Stack Exchange

38
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
How to get the content?
<p>My Title</p>.

<a onclick=”postMyContent()”>Click Me.</a>

<div id="contentDiv"></div>

“My Title. Click Me. Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad m…”

Lo, Bhagia, Lambert – Language Modeling Tutorial 39


Coaxing content from JS
requires site-specific logic

40
What websites to target?
Quality Volume Diiculty Coverage

example.com Highly curated. ~ 100,000 pages Full comment expansion


Substantial non-text information* 1,800 words/page makes slower. ~ 100 words /
sec **

example.org Highly curated. ~100,000 pages, 742 w/p ~ 186 words / sec

example.net Subjectively poorer. ~200,000 pages, 1,280 w/p ~ 256 words/sec

example.ai Walled; evidence of generated content Walled Non-web modality Walled

example.cat Highly curated. Reports 3MM books. Free range crawl Maybe not very crawlable?
Substantial non-text information, PDF

example.gov High variance (generally curated) Reports 2.5MM sites over 92 Free range crawl Single URLs vs. Linked Sites;
languages. High variance on highly parallelizable
words/doc.

example.xyz High variance (generally curated) 12,000 English Urls reported. Free range crawl Single URLs vs. Linked Sites;
High variance. highly parallelizable
Lo, Bhagia, Lambert – Language Modeling Tutorial 41
Broad & wide crawls are Domain-specific crawls are
easiest to scale easiest to ensure quality

Common Crawl, Internet Archive, Math exercises, code notebooks,


Software Heritage, big tech company Q&A forum posts, Stack Exchange

42
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
Harder to get data via crawling

Longpre et. al. 2024. Consent in Crisis: The Rapid Decline of the AI Data Commons. Data Provenance Initiative.
Lo, Bhagia, Lambert – Language Modeling Tutorial 43
Widening inequality in data access

Lo, Bhagia, Lambert – Language Modeling Tutorial 44


Data transformation

45
The data curation loop

acquire data

transform the data


(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 46


Linearization

47
What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 48


What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 49


What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 50


What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 51


Poor linearization can be irrecoverable
Before After

Choppy. Sentences split


to many newlines. A lot
of undesirable website
content.
Before
.\" @(#)arithmetic.6 8.1 (Berkeley) 5/31/93 .\" $FreeBSD:
src/games/arithmetic/arithmetic.6,v 1.3 1999/08/27 23:28:52 peter Exp $ metadata
.\" $DragonFly: src/games/arithmetic/arithmetic.6,v 1.2 2003/06/17
04:25:22 dillon Exp $ .\" .TH ARITHMETIC 6 "May 31, 1993" .UC 4 .SH NAME
arithmetic \- quiz on simple arithmetic .SH SYNOPSIS .B arithmetic .B [ \-o
+\-x/ .B ] .B [ \-r range .B ] .SH DESCRIPTION .I Arithmetic asks you to solve
problems in simple arithmetic. Each question must be answered correctly
before going on to the next. After every 20 problems, it prints the score so
far and the time taken. You can quit at any time by typing the interrupt or
end-of-file character. .PP The options are as follows: .TP \-o By default, .I
arithmetic asks questions on addition of numbers from 0 to 10, and
corresponding subtraction. By supplying one or more of the characters .BR extra chars
+\-x/ , you can ask for problems in addition, subtraction, multiplication, and
division, respectively. If you give one of these characters more than once,
that kind of problem will be asked correspondingly more often. .TP \-r If a .I
range is supplied, .I arithmetic selects the numbers in its problems in the
following way. For addition and multiplication, the numbers to be added or
multiplied are between 0 and .IR range , inclusive. For subtraction and missing word
division, both the required result and the number to divide by or subtract
will be between 0 and .IR range . (Of course, .I arithmetic will not ask you to
divide by 0.) The default .I range is 10. .PP When you get a problem wrong, .I
arithmetic will remember the numbers involved, and will tend to select
those numbers more often than others, in problems of the same sort.
Eventually it will forgive and forget. .PP .I Arithmetic cannot be persuaded
to tell you the right answer. You must work it out for yourself. .SH
DIAGNOSTICS ``What?'' if you get a question wrong. ``Right!'' if you get it
Before After
.\" @(#)arithmetic.6 8.1 (Berkeley) 5/31/93 .\" $FreeBSD: The arithmetic command provides a quiz on simple arithmetic. Each question
src/games/arithmetic/arithmetic.6,v 1.3 1999/08/27 23:28:52 peter Exp $ must be answered correctly before proceeding to the next. After every 20
.\" $DragonFly: src/games/arithmetic/arithmetic.6,v 1.2 2003/06/17 problems, it displays the score and the time taken. You can quit at any time by
04:25:22 dillon Exp $ .\" .TH ARITHMETIC 6 "May 31, 1993" .UC 4 .SH NAME typing the interrupt or end-of-file character.
arithmetic \- quiz on simple arithmetic .SH SYNOPSIS .B arithmetic .B [ \-o
+\-x/ .B ] .B [ \-r range .B ] .SH DESCRIPTION .I Arithmetic asks you to solve The options are as follows:
problems in simple arithmetic. Each question must be answered correctly
before going on to the next. After every 20 problems, it prints the score so - \-o: By default, arithmetic asks questions on addition of numbers from 0 to 10,
far and the time taken. You can quit at any time by typing the interrupt or and corresponding subtraction. By supplying one or more of the characters
end-of-file character. .PP The options are as follows: .TP \-o By default, .I
+\-x/, you can ask for problems in addition, subtraction, multiplication, and
arithmetic asks questions on addition of numbers from 0 to 10, and
division, respectively. If you give one of these characters more than once, that
corresponding subtraction. By supplying one or more of the characters .BR
kind of problem will be asked correspondingly more often.
+\-x/ , you can ask for problems in addition, subtraction, multiplication, and
- \-r: If a range is supplied, arithmetic selects the numbers in its problems in the
division, respectively. If you give one of these characters more than once,
following way. For addition and multiplication, the numbers to be added or
that kind of problem will be asked correspondingly more often. .TP \-r If a .I
multiplied are between 0 and the range, inclusive. For subtraction and division,
range is supplied, .I arithmetic selects the numbers in its problems in the
both the required result and the number to divide by or subtract will be between
following way. For addition and multiplication, the numbers to be added or
0 and the range. (Of course, arithmetic will not ask you to divide by 0.) The
multiplied are between 0 and .IR range , inclusive. For subtraction and
default range is 10.
division, both the required result and the number to divide by or subtract
will be between 0 and .IR range . (Of course, .I arithmetic will not ask you to
When you get a problem wrong, arithmetic will remember the numbers involved
divide by 0.) The default .I range is 10. .PP When you get a problem wrong, .I
and will tend to select those numbers more often than others in problems of the
arithmetic will remember the numbers involved, and will tend to select
those numbers more often than others, in problems of the same sort. same sort. Eventually, it will forgive and forget.

Eventually it will forgive and forget. .PP .I Arithmetic cannot be persuaded


to tell you the right answer. You must work it out for yourself. .SH Arithmetic cannot be persuaded to tell you the right answer. You must work it out

DIAGNOSTICS ``What?'' if you get a question wrong. ``Right!'' if you get it for yourself.
What about PDFs & Scanned Docs?
Old Scanned Docs Using Classical OCR Pipeline

Christians behaving themselves like Ma borne- a . t >.


dans.3 ."5/0-

4. The natives soon had reason to suspect the viceroy,


viceroy’s sincerity in his expressions of regret
at the proceedings of which they complained. &»"«■'
For about this time the Dominican friars, under
pretence of building a. convent, erected a fortress
on the island of Sol or, which, as soon as
finished, the viceroy garrisoned with a strong
force. The natives' very naturally felt indig-
S nant at this additional encroachment, and took
every opportunity to aack the garrison. The
monks, forgetful/ of their peaceable profession,
took an active part in these skirmishes, and
many of tbg.tr fell sword in hand.

The i'lfinomedan faith has been appropriately


entitled., The religion of the sword,; and with
equal propriety may we so designate the re-
.■. i'gv.m of these belligerent friars. The Portu-
gu writers give an account of one of their
missionaries, Fernando Vinagre, who was as
prompt in the field of bale as at the baptismal
font. This man, though a secular priest, undertook
the command of a squadron that was
I sent to the assistance of the rajah of Tidore,4 on
which occasion he is said to have acted in the
twofold capacity of a great commander, and a
great apostle, at one time appearing in armour,
; at another in a surplice; and even occasionally,
baptizing the converts of his sword without puing
o his armour, but covering it with his
ecclesiastical vest. In this crusade5 he had two
3 Geddea History, &c., pp. 24—27.
Pudet hsec opprobria nobis
Vel dici potuisse.
* Called T a d u ra or D a c o , an island in the Indian Ocean,
one of the Moluccas
5 ‘ These a la D ra g o o n conversions.’ Geddes' History, p. 27.
Old Scanned Docs Using Classical OCR Pipeline

Christians behaving themselves like Ma borne- a . t >.


dans.3 ."5/0-

4. The natives soon had reason to suspect the viceroy,


viceroy’s sincerity in his expressions of regret
Random symbols
at the proceedings of which they complained. &»"«■'
For about this time the Dominican friars, under
pretence of building a. convent, erected a fortress
on the island of Sol or, which, as soon as
finished, the viceroy garrisoned with a strong
force. The natives' very naturally felt indig-
S nant at this additional encroachment, and took
every opportunity to aack the garrison. The
monks, forgetful/ of their peaceable profession,
took an active part in these skirmishes, and
many of tbg.tr fell sword in hand.

The i'lfinomedan faith has been appropriately Bogus words


entitled., The religion of the sword,; and with
equal propriety may we so designate the re-
.■. i'gv.m of these belligerent friars. The Portu-
gu writers give an account of one of their
missionaries, Fernando Vinagre, who was as
prompt in the field of bale as at the baptismal
font. This man, though a secular priest, undertook
the command of a squadron that was
I sent to the assistance of the rajah of Tidore,4 on
which occasion he is said to have acted in the
twofold capacity of a great commander, and a
great apostle, at one time appearing in armour,
; at another in a surplice; and even occasionally,
baptizing the converts of his sword without puing
o his armour, but covering it with his
ecclesiastical vest. In this crusade5 he had two Scrambled footnotes
3 Geddea History, &c., pp. 24—27.
Pudet hsec opprobria nobis
Vel dici potuisse.
* Called T a d u ra or D a c o , an island in the Indian Ocean,
one of the Moluccas
5 ‘ These a la D ra g o o n conversions.’ Geddes' History, p. 27.
Old Scanned Docs What we would like instead
Christians behaving themselves like Mahomedans.³ No more weird symbols
The natives soon had reason to suspect the viceroy’s sincerity in his
expressions of regret at the proceedings of which they complained.
For about this time the Dominican friars, under pretence of building a
convent, erected a fortress on the island of Solor, which, as soon as
finished, the viceroy garrisoned with a strong force. The natives very
naturally felt indignant at this additional encroachment, and took
every opportunity to aack the garrison. The monks, forgetful of their
peaceable profession, took an active part in these skirmishes, and
many of them fell sword in hand.

The Mahomedan faith has been appropriately entitled, *The religion of Proper OCR
the sword*; and with equal propriety may we so designate the religion
of these belligerent friars. The Portuguese writers give an account of despite bad
one of their missionaries, Fernando Vinagre, who was as prompt in the lighting
field of bale as at the baptismal font. This man, though a secular
priest, undertook the command of a squadron that was sent to the
assistance of the rajah of Tidore,⁴ on which occasion he is said to have
acted in the twofold capacity of a great commander, and a great
apostle, at one time appearing in armour, at another in a surplice; and
even occasionally, baptizing the converts of his sword without puing
o his armour, but covering it with his ecclesiastical vest. In this
crusade⁵ he had two

---
³ Geddes History, &c., pp. 24—27. Proper handling of footnotes
Pudet haec opprobria nobis
Vel dici potuisse.
⁴ Called *Tadure* or *Daco*, an island in the Indian Ocean, one of the
Moluccas
⁵ 'These a la Dragoon conversions.' Geddes' History, p. 27.
Is there a “best” linearization?
Filtering

62
Filter low-quality content

bufvc.ac.uk/allbufvc/search.php/item?q=Discussion
Lo, Bhagia, Lambert – Language Modeling Tutorial
hps://finance.yahoo.com 63
Filter low-quality content

Lo, Bhagia, Lambert – Language Modeling Tutorial 64


Filter undesirable content
Toxic / NSFW Personally identifiable information

hps://i.redd.it/r76y8e47qrvb1.jpg hps://www.reddit.com/r/interestingasfuck/comments/10slutr/the_cats_that_sailed_on_ships_until_the_mid20th/
Lo, Bhagia, Lambert – Language Modeling Tutorial 65
Filter duplicate data

Lo, Bhagia, Lambert – Language Modeling Tutorial 66


How much filtering?

175 TB ≈65 x 2.7 TB


CommonCrawl Dolma 1.7

240T tokens ≈65 x 3.8T tokens


CommonCrawl DCLM

Lo, Bhagia, Lambert – Language Modeling Tutorial 67


The data curation loop

acquire data

transform the data


(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 68


The data curation loop

acquire data
Language filtering

Quality filtering
transform the data
(data intervention)
Safety filtering

Deduplication
run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 69


Use small text classifiers for everything

Lo, Bhagia, Lambert – Language Modeling Tutorial 70


Use small text classifiers for everything

fastText
● 2,000 docs per second per CPU
● $0.04/hr ($8.5/hr for c7i instance, 192 cores)

BERT-Base
● 1,600 docs per second per H100
● $2.50/hr

Lo, Bhagia, Lambert – Language Modeling Tutorial 71


Use small text classifiers for everything

Lo, Bhagia, Lambert – Language Modeling Tutorial 72


Two text classifier ideologies

“Give me more like this” “Give me less like this”

Positive: Llama-labeled “Edu” content Positive: Diverse set of “High Quality” docs
Negative: Llama-labeled “non-Edu” content Negative: Randomly sampled Common Crawl

Lo, Bhagia, Lambert – Language Modeling Tutorial 73


Label whole documents with small models

Lo, Bhagia, Lambert – Language Modeling Tutorial 74


Common format for all labeled output

Lo, Bhagia, Lambert – Language Modeling Tutorial 75


Common format for all labeled output
fastText comes with trained weights.
Dolma trained fastText classifier on Jigsaw.
Beer than cld2 and cld3

C4, Gopher used rules. Dolma, FineWeb, DCLM all use some
commonsense text heuristics.
FineWeb used Llama 70B to generate labels.
Distill into fastText.

DCLM trained fastText on OpenHermes + ELI5.


Lo, Bhagia, Lambert – Language Modeling Tutorial 76
Label passages with small models

Lo, Bhagia, Lambert – Language Modeling Tutorial 77


Common format for all labeled output

Lo, Bhagia, Lambert – Language Modeling Tutorial 78


Finding duplicates is also tagging

BloomFilter, minHash, exact Match, …

Lo, Bhagia, Lambert – Language Modeling Tutorial 79


Assembling the final dataset

“remove all documents


with English score < 0.5
and >70% of lines are
ungrammatical or are
duplicated more than 20
times and…”

Lo, Bhagia, Lambert – Language Modeling Tutorial 80


Sneaky data issues

81
Side-eects of filters (e.g. “quality”)

Lo, Bhagia, Lambert – Language Modeling Tutorial 82


Side-eects of filters (e.g. “quality”)

Lo, Bhagia, Lambert – Language Modeling Tutorial 83


Side-eects of filters (e.g. deduplication)

hps://nationalstocksign.com/terms.php
hps://www.aximtrd.com/term-conditions

hps://www.lawinsider.com/clause/definitions
hps://www.picturesofengland.com/agreements/ExtendedLicence

Lo, Bhagia, Lambert – Language Modeling Tutorial 84


Trade-o between filter speed vs quality

20 docs/sec 5000 docs/sec

Subramani et. al. 2023. Detecting Personal Information in Training Corpora: an Analysis. TrustNLP workshop at ACL.
Lo, Bhagia, Lambert – Language Modeling Tutorial 85
Coarse text is faster but can hide issues
Scientific in English

NSFW in Chinese

Lo, Bhagia, Lambert – Language Modeling Tutorial 86


Curation steps to Data Pipelines

87
Each data source requires own pipeline

GitHub Code
Lo, Bhagia, Lambert – Language Modeling Tutorial 88
Each data source requires own pipeline

Common Crawl Web Documents


Lo, Bhagia, Lambert – Language Modeling Tutorial 89
Making good decisions

90
The data curation loop

acquire data

transform the data


(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 91


“Data ablations”

1B LMs / 150B tokens ~1 day on 64 A100 (40GB) w/ 400 Gbps

Lo, Bhagia, Lambert – Language Modeling Tutorial 92


Easier said than done…

93
Some data interventions hard to “test”
deduplication

No eect from
deduplication?

Lo, Bhagia, Lambert – Language Modeling Tutorial 94


Small models are cheap!

So lile noise!

Lo, Bhagia, Lambert – Language Modeling Tutorial 95


But not all configurations work for every task

Omg

Lo, Bhagia, Lambert – Language Modeling Tutorial 96


Emerging threads in data

97
Data curriculum actually works?

Stage 1: 4T tokens on this…

Lo, Bhagia, Lambert – Language Modeling Tutorial 98


Data curriculum actually works? Stage 2: 300B tokens on this…

Stage 1: 4T tokens on this…

Lo, Bhagia, Lambert – Language Modeling Tutorial 99


Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

Lo, Bhagia, Lambert – Language Modeling Tutorial 100


Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data

Lo, Bhagia, Lambert – Language Modeling Tutorial 101


Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data

● New data sources you


didn’t have during Stage 1

Lo, Bhagia, Lambert – Language Modeling Tutorial 102


Synthetic actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data

● New data sources you


didn’t have during Stage 1

● “Synthetic” data

Lo, Bhagia, Lambert – Language Modeling Tutorial 103


Takeaways
1. What is “good” data?
a. Scale & Quality

Lo, Bhagia, Lambert – Language Modeling Tutorial 104


Takeaways
1. What is “good” data?
a. Scale & Quality

2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!

Lo, Bhagia, Lambert – Language Modeling Tutorial 105


Takeaways
1. What is “good” data?
a. Scale & Quality

2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!

3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!

Lo, Bhagia, Lambert – Language Modeling Tutorial 106


Takeaways
1. What is “good” data?
a. Scale & Quality

2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!

3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!

4. Good engineering maers!


a. Methods need to scale eiciently and generalize (and/or hire a lot of people).

Lo, Bhagia, Lambert – Language Modeling Tutorial 107


Takeaways
1. What is “good” data?
a. Scale & Quality

2. Data acquisition
a. Crawling is hard.
i. Broad → Scale VS Domain-specific → Quality. Scales with people, compute, time and $$$.
b. Use public bulk APIs. Support their eorts!

3. Data transformation
a. Filtering, filtering, filtering.
b. Don’t forget about linearization and choice of text units.
c. Manually inspect your data often!

4. Good engineering maers!


a. Methods need to scale eiciently and generalize (and/or hire a lot of people).

5. Making good decisions


a. Don’t rely on intuition. Experiment, experiment experiment.
b. Pretraining maers. If my experiment “failed” …. WHY?!
Lo, Bhagia, Lambert – Language Modeling Tutorial 108
Break
(Or catching up if behind)
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

109
Speaker: Akshita Bhagia

Pretraining
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

110
Pretraining

Goal: Equip the language model with general language capabilities through
self-supervised training on large amounts of unstructured text.

Lo, Bhagia, Lambert – Language Modeling Tutorial 111


Pretraining
As measured by
standard
evaluation
benchmarks

Goal: Equip the language model with general language capabilities through
self-supervised training on large amounts of unstructured text.

To be good at next No specific input /


word prediction output format

Lo, Bhagia, Lambert – Language Modeling Tutorial 112


001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

113
001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

114
Transformer

Lo, Bhagia, Lambert – Language Modeling Tutorial 115


Transformer

Dubey, Abhimanyu et al. “The Llama 3 Herd of Models.” ArXiv abs/2407.21783 (2024).

Lo, Bhagia, Lambert – Language Modeling Tutorial 116


How do you configure a
transformer model?

117
Training Configurations
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens

ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04

pos embeddings rope rope rope min lr 3.00E-05 3.00E-05 1.20E-05


attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 118
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens

ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04

pos embeddings rope rope rope min lr 3.00E-05 3.00E-05 1.20E-05


attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 119
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens

ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04


Input
pos embeddings rope representation
rope rope min lr 3.00E-05 3.00E-05 1.20E-05
attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 120
Training Configurations
Config A B C Config A B C
d_model 4096 Size and shape
4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 How to2000 4B tokens
optimize loss
ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04
Input
pos embeddings rope representation
rope rope min lr 3.00E-05 3.00E-05 1.20E-05
attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 121
Models don’t always agree on best configs
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens

ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04

pos embeddings rope rope rope min lr 3.00E-05 3.00E-05 1.20E-05


attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 122
Some “standard” choices
Config A B C Config A B C
d_model 4096 4096 4544 weight tying FALSE FALSE FALSE
n_heads 32 32 71 optimizer adamw adamw adamw
megaton_full_i (probably closer to megatron
n_layers 32 32 32 init nit mitch full init)
mlp_ratio 5.375 ~6 ?? warmup 2000 2000 4B tokens

ln type RMSNorm parametric parametric peak lr 3.00E-04 3.00E-04 6.00E-04

pos embeddings rope rope rope min lr 3.00E-05 3.00E-05 1.20E-05


attention_ln (qk
layernorm) FALSE FALSE FALSE wd 0.1 0.1 0.1
multi query attention FALSE FALSE TRUE beta1 0.9 0.9 0.999
parallel blocks FALSE FALSE TRUE beta2 0.95 0.95 0.999

affine in layer norm TRUE TRUE TRUE eps 1.00E-05 1.00E-05 1.00E-05
bias in layer norm FALSE TRUE TRUE schedule cosine cosine cosine
activation swiglu swiglu GELU grad clip global 1 global 1 global 1
sequence length 4000 2048 2048 reduce fp32 fp32 bf16
batch size - instances 1024 2048 2304 optimizer state n/a fp32 fp32
batch size warmup n/a No linear (30B tokens) z-loss n/a No 1.00E-04 123
A mistake in pretraining can
cost up to millions of dollars…

124
Pre-training runs are costly

125
“Standard” practices

126
Size of the model
Given a fixed compute budget C, what model size do you train?

Lo, Bhagia, Lambert – Language Modeling Tutorial 127


Size of the model
Given a fixed compute budget C, what model size do you train?
● Estimate “optimal” model size (and the number of training tokens) using scaling laws

C ≈ 6ND, D ≈ 20N
“Performance depends strongly on scale, weakly on model shape”
Kaplan, Jared et al. “Scaling Laws for Neural Language Models.”

The ratio of width to depth may depend on the domain


Henighan, Tom et al. “Scaling Laws for Autoregressive Generative Modeling.”

Lo, Bhagia, Lambert – Language Modeling Tutorial 128


Size of the model
Given a fixed compute budget C, what model size do you train?
● Estimate “optimal” model size (and the number of training tokens) using scaling laws

C ≈ 6ND, D ≈ 20N
“Performance depends strongly on scale, weakly on model shape”
Kaplan, Jared et al. “Scaling Laws for Neural Language Models.”

The ratio of width to depth may depend on the domain


Henighan, Tom et al. “Scaling Laws for Autoregressive Generative Modeling.”

● Focus on improving inference-optimality


De Vries, Harm. "Go smol or go home." (2023).

Lo, Bhagia, Lambert – Language Modeling Tutorial 129


Aention variants

OLMo-7B Llama 3 - 8B Falcon-7B

Ainslie, Joshua et al. “GQA: Training Generalized Multi-Query


Transformer Models from Multi-Head Checkpoints.” Lo, Bhagia, Lambert – Language Modeling Tutorial 130
Rotary Position Embeddings (RoPE)
● Can capture longer sequences
● Higher theta values make the model more sensitive to positional
changes

Su, Jianlin et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv abs/2104.09864 (2021): n. Pag.

Xiong, Wenhan et al. “Eective Long-Context Scaling of Foundation Models.” North American Chapter of the Association for
Computational Linguistics (2023).

Lo, Bhagia, Lambert – Language Modeling Tutorial 131


SwiGLU activation

Swish activation + Gated Linear Unit


(GLU)

Swish improves upon ReLU by GLU is eective at capturing


providing a smoother transition long-range dependencies while
around 0, which helps in avoiding vanishing gradients.
optimization.

Shazeer, Noam M.. “GLU Variants Improve Transformer.” ArXiv abs/2002.05202 (2020): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 132


Other “standard” choices
● AdamW optimizer

● RMSNorm

● Cosine learning rate schedule

Lo, Bhagia, Lambert – Language Modeling Tutorial 133


Summary
● Search space of architecture choices and hyperparameters is large.

● Starting from common choices can save compute.

● These are not set in stone, still active areas of research.

Lo, Bhagia, Lambert – Language Modeling Tutorial 134


001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

135
What to look for?
How to determine if your model is training well?

● Loss convergence

Lo, Bhagia, Lambert – Language Modeling Tutorial 136


What to look for?
How to determine if your model is training well?

● Loss convergence

In-loop perplexity
● Language modeling fit (potentially on specific domains) evaluations

Magnusson, Ian et al. “Paloma: A Benchmark for Evaluating Language Model Fit.” ArXiv abs/2312.10523 (2023): n. Pag.
Note: Poster at neurips on Friday at 4.30 pm!

Lo, Bhagia, Lambert – Language Modeling Tutorial 137


What to look for?
How to determine if your model is training well?

● Loss convergence

● Language modeling fit (potentially on specific domains)

Use a standard set of


● Downstream task performance benchmarks

Lo, Bhagia, Lambert – Language Modeling Tutorial 138


What to look for?
How to determine if your model is training well?

● Loss convergence

● Language modeling fit (potentially on specific domains)

Use a standard set of


● Downstream task performance benchmarks

Lo, Bhagia, Lambert – Language Modeling Tutorial 139


What to look for?
How to determine if your model is training well?

● Loss convergence

Is this enough?
● Language modeling fit (potentially on specific domains)

● Downstream task performance

Lo, Bhagia, Lambert – Language Modeling Tutorial 140


Consider this run …

Downstream performance
looks fine

hps://wandb.ai/ai2-llm/OLMo-7B/reports

Lo, Bhagia, Lambert – Language Modeling Tutorial 141


Will it continue to improve?

But gradient norm is spiky


… and steadily increasing

hps://wandb.ai/ai2-llm/OLMo-7B/reports
Takase, Sho et al. “Spike No More: Stabilizing the Pre-training of
Large Language Models.” (2023). Lo, Bhagia, Lambert – Language Modeling Tutorial 142
Spikes can indicate eventual divergence

For larger
models, spikes
can be an early
indicator of
model
divergence

Takase, Sho et al. “Spike No More: Stabilizing the Pre-training of Large Language Models.” ArXiv abs/2312.16903 (2023): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 143


Dierent types of spikes

Fast spike Slow spike

Lo, Bhagia, Lambert – Language Modeling Tutorial 144


Spikes

Fast spike Slow spike

Look at your data!

Lo, Bhagia, Lambert – Language Modeling Tutorial 145


Slow spikes
You want these to occur earlier in your training, so that you can intervene.

● Use higher learning rate at smaller model sizes

Wortsman, Mitchell et al. “Small-scale proxies for large-scale Transformer training instabilities.” ArXiv abs/2309.14322 (2023): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 146


Stability fix: Initialization
Make sure that your initialization has the following properties:
Without these,
● The scale of activations and gradients should remain roughly the same from layer to layer. learning can
● The scale of activations and gradients should scale with the model width. become unstable

Use normal
initialization!

Cowsik, Aditya et al. “Geometric Dynamics of Signal Propagation Predict Trainability of Transformers.” ArXiv abs/2403.02579 (2024): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 147


Stability fix: Layer Norms
Use RMSNorm (already standard practice)

Additionally,

● Use QK-Norm Use these


● Change the order of the layer norm together!

Zhang, Biao and Rico Sennrich. “Root Mean Square Layer Normalization.” ArXiv abs/1910.07467 (2019): n. Pag.

Team, Chameleon. “Chameleon: Mixed-Modal Early-Fusion Foundation Models.” ArXiv abs/2405.09818 (2024): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 148


Stability fix: No weight decay on embeddings
Guard against very small token embeddings

Do not decay embedding


weights

Lo, Bhagia, Lambert – Language Modeling Tutorial 149


Stability fix: AdamW epsilon

Be careful of defaults in your


training library!

Lo, Bhagia, Lambert – Language Modeling Tutorial 150


001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

151
Do more with less compute

Before the After the Use eicient


pretraining run pretraining run architectures

Lo, Bhagia, Lambert – Language Modeling Tutorial 152


Do more with less compute

Before the After the Use eicient


pretraining run pretraining run architectures

Lo, Bhagia, Lambert – Language Modeling Tutorial 153


Before the pretraining run

Run your experiments on smaller models first

Lo, Bhagia, Lambert – Language Modeling Tutorial 154


Run experiments on smaller models first
Find optimal hyperparameters using Maximal Update Parametrization (µP)

Yang, Greg et al. “Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer.” (2022)

Lo, Bhagia, Lambert – Language Modeling Tutorial 155


Run experiments on smaller models first
Predict the performance of larger models on downstream tasks

Take decisions about


data ablations

Gadre, Samir Yitzhak et al. “Language models scale reliably with over-training and on downstream tasks.” ArXiv abs/2403.08540 (2024): n.
pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 156


Run experiments on smaller models first
Predict the performance of larger models on downstream tasks

Bhagia, Akshita et al. “Establishing Task Scaling Laws via Compute-Eicient Model Ladders.” (2024).

Lo, Bhagia, Lambert – Language Modeling Tutorial 157


Before the pretraining run
Run your experiments on smaller models first

How to ensure that small model behavior will match the large model?

● Scaling laws for optimal batch size and learning rate

Porian, Tomer et al. “Resolving Discrepancies in Compute-Optimal Scaling of Language Models.” ArXiv abs/2406.19146 (2024): n. pag.

Lo, Bhagia, Lambert – Language Modeling Tutorial 158


Do more with less compute

Before the After the Use eicient


pretraining run pretraining run architectures

Lo, Bhagia, Lambert – Language Modeling Tutorial 159


Annealing
warmup to
learning max (3e-4)
rate

3e-4

5e-5

0 tokens
<10B tokens Trillions of tokens 50B tokens
Annealing
learning
rate

3e-4 cosine decay


to 5e-5

5e-5

0 tokens
<10B tokens Trillions of tokens 50B tokens
Annealing
learning
rate

3e-4

linear to 0
5e-5

0 tokens
<10B tokens Trillions of tokens 50B tokens
LR → 0 is all you need?

Lo, Bhagia, Lambert – Language Modeling Tutorial 163


Annealing + Curriculum
● Curriculum training
Inject new knowledge /
capabilities

● Assess data quality

Llama 3
Do more with less compute

Before the After the Use eicient


pretraining run pretraining run architectures

Lo, Bhagia, Lambert – Language Modeling Tutorial 165


Mixture of experts (MoE)

Lo, Bhagia, Lambert – Language Modeling Tutorial 166


Mixture of experts (MoE)

Models
MOE Dense

Note: Learn more about OLMoE at the ESNLP workshop at neurips on Saturday!

Lo, Bhagia, Lambert – Language Modeling Tutorial 167


001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

168
Using hardware eectively
Goal: maximize the number of tokens processed per second (TPS) without loss of
model performance

Faster training enables more experimentation, since it eectively increases the


size of your cluster.

Lo, Bhagia, Lambert – Language Modeling Tutorial 169


Training parallelism
● Data parallelism

● Model parallelism

● Tensor parallelism

● Pipeline parallelism

Lo, Bhagia, Lambert – Language Modeling Tutorial 170


Training parallelism
● Data parallelism

● Model parallelism
In practice, use FSDP …

● Tensor parallelism

● Pipeline parallelism

Lo, Bhagia, Lambert – Language Modeling Tutorial 171


Training parallelism
● Data parallelism

● Model parallelism
In practice, use FSDP …

● Tensor parallelism

… but ensure that your global


● Pipeline parallelism batch size is not too large

Lo, Bhagia, Lambert – Language Modeling Tutorial 172


Use available optimizations
● FlashAention

● torch.compile Ensure that your model code is


simple

Flash aention library: hps://github.com/Dao-AILab/flash-aention


torch.compile manual

Lo, Bhagia, Lambert – Language Modeling Tutorial 173


Training Parallelism
10,000+ TPS

5,000 TPS

750 TPS

Improved data Improved pipeline


parallelism parallelism
174
Garbage collection

Manual GC!

Lo, Bhagia, Lambert – Language Modeling Tutorial 175


Asynchronous bookkeeping
The training loop does other things besides learning model weights.

1. Monitoring the health of the run requires logging a lot of metrics.

This can cause slow-downs in the distributed seing.

2. Saving checkpoints

As the size of the model increases, checkpointing can become a boleneck.

Solution: Use a separate backend for such bookkeeping tasks

Lo, Bhagia, Lambert – Language Modeling Tutorial 176


OLMo-core library makes these options easily configurable
hps://github.com/allenai/OLMo-core

Lo, Bhagia, Lambert – Language Modeling Tutorial 177


001 Architecture choices

002 The health of the pretraining run

003 Do more with less compute

004 Using hardware eectively

005 Takeaways

178
Takeaways
1. Minimize the things you need to worry about

2. Hardware bugs can impact your model

3. Throughput maers

4. Data and pretraining are intertwined

Lo, Bhagia, Lambert – Language Modeling Tutorial 179


“We oer no explanation as to
why these architectures seem
to work; we aribute their
success, as all else, to divine
benevolence.” - SwiGLU paper

180
Break
(Or catching up if behind)
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

181
Speaker: Nathan lambert

Adaptation
(Post-training)
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

182
Language model adaptation
The raw pre-trained LMs are neither safe nor robust for public use and interactions,
thus require “alignment” between AI and humans.

Follow natural language instructions

Be aware of harmful behaviors

Respond according to human


preference

Improve core skills

183
Initial approaches to modern post-training
ChatGPT blog post:

We trained this model using


Reinforcement Learning from Human
Feedback (RLHF), using the same
methods as InstructGPT, but with
slight dierences in the data collection
setup.

Ouyang et al. 2022. InstructGPT.


Lo, Bhagia, Lambert – Language Modeling Tutorial 184
Initial approaches to modern post-training

From: hps://www.interconnects.ai/p/frontier-model-post-training
Lo, Bhagia, Lambert – Language Modeling Tutorial 185
Initial approaches to modern post-training
Three stage approach:
1. Instruction tune base model.
2. Collect preference data & train reward model.
3. Fine-tune with RL.

Focus on general chat capabilities (this was new at the time!)

Lo, Bhagia, Lambert – Language Modeling Tutorial 186


Current frontier model post-training
Complex process for:
● Addressing many capabilities and evaluations.
● Leveraging synthetic data and scaled human data
pipelines.

Lo, Bhagia, Lambert – Language Modeling Tutorial 187


Current frontier model post-training

Dubey, Abhimanyu, et al. 2024. Llama 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 188
Current frontier model post-training

Adler, Bo, et al. 2024. Nemotron-4 340B.


Lo, Bhagia, Lambert – Language Modeling Tutorial 189
Current frontier model post-training

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 190
Two eras adaptation pipelines

From: hps://www.interconnects.ai/p/frontier-model-post-training
Lo, Bhagia, Lambert – Language Modeling Tutorial 191
Current frontier model post-training
Three training objectives are most popular:
1. Supervised Finetuning – teach formaing and for base of
instruction following abilities.
2. Preference Finetuning – align to human preferences (and
smaller bump in capabilities).
3. Reinforcement Finetuning – final stage to boost
performance on verifiable tasks.

Lo, Bhagia, Lambert – Language Modeling Tutorial 192


Adaptation
Outline
1. Background & History
2. Prompts & Skill Selection
3. Supervised Finetuning (SFT) /
Instruction Finetuning (IFT)
4. Preference Finetuning (PreFT)
5. RL and advanced tuning
6. Open questions

193
Geing the ingredients to start
post-training
Successful adaptation starts with:
1. Meaningful evaluations for targeted skills, and
2. Prompts of representative queries for said skills.

Lo, Bhagia, Lambert – Language Modeling Tutorial 194


Geing the ingredients to start
post-training: Evaluation
Post-training with modern
language models can target:
● Specialized models (0-3
skills): e.g. Math / Code
models
● General models (many
skills): e.g. Instruct models
Example evaluation set from Tülu 3 general adapted models.
Unseen evaluations used to test generalization.

Lambert, Nathan et al. 2024. Tülu 3. Lo, Bhagia, Lambert – Language Modeling Tutorial 195
Geing the ingredients to start
post-training: Prompts
All post-training stages require prompts in distribution of tasks.
Example prompt budget:

● Supervised Finetuning – ~1 million.


● Preference Finetuning – ~1 million, partial overlap with SFT can be useful.
● Reinforcement Finetuning – ~10 - 100 thousand (data less available)
● Large variance on these numbers is possible.

Lo, Bhagia, Lambert – Language Modeling Tutorial 196


Adaptation
Outline
1. Background & History
2. Prompts & Skill Selection
3. Supervised Finetuning (SFT) /
Instruction Finetuning (IFT)
4. Preference Finetuning (PreFT)
5. RL and advanced tuning
6. Open questions

197
The role of instruction tuning
Accomplishes two primary tasks:
1. Adapt base model to specific style of input for chat interactions.
2. Ability to include system prompts, multi-turn dialogues, and other chat
templates.
A very large proportion of post-training gains come from the
SFT stage.

Lo, Bhagia, Lambert – Language Modeling Tutorial 198


The role of instruction tuning

Accomplishes two primary tasks:


1. Adapt base model to specific style of input for chat interactions.
2. Ability to include system prompts, multi-turn dialogues, and other
chat templates.

<|system|>
You’re a helpful agent System prompt
<|end|>
Special <|user|>
tokens {query}
<|end|>
<|assistant|>{Answer goes here}

Lo, Bhagia, Lambert – Language Modeling Tutorial 199


Example instruction

Stack Overflow :What makes a transformer a transformer?, nbro 2021

Lo, Bhagia, Lambert – Language Modeling Tutorial 200


Key idea: Self-instruct / synthetic data
Start: N high-quality (often human) prompts

Ask a strong LM: Create a modified version


of these instructions.

Generate completions with another (or


same) strong LM.

End: easily 10x more (synthetic) training


data!
Taori et al. 2023. Alpaca.

(synthetic data = text generated by another LLM)

Wang et al. 2022. Self-Instruct.


Lo, Bhagia, Lambert – Language Modeling Tutorial 201
First open chat tuned models
Alpaca Vicuna (lmsys/vicuna-7b-delta-v0)
13 Mar. 2023 30 Mar. 2023
● 52k self-instruct style data distilled ● Fine-tunes ChatGPT data from
from text-davinci-003 ShareGPT
● Model weight diff. to LLaMA 7B ● LLaMA 7B and 13B diff’s
https://crfm.stanford.edu/2023/03/13/alpaca.html ● Introduces LLM-as-a-judge
https://lmsys.org/blog/2023-03-30-vicuna/

Koala
3 Apr. 2023 Dolly
● Diverse dataset (Alpaca, Anthropic 12 Apr. 2023
HH, ShareGPT, WebGPT…) ● 15k human written data
● Human evaluation ● Trained on Pythia 12b
https://www.databricks.com/blog/2023/04/12/dolly
● LLaMA 7B diff. -first-open-commercially-viable-instruction-tuned-l
https://bair.berkeley.edu/blog/2023/04/03/koala/ lm

Lo, Bhagia, Lambert – Language Modeling Tutorial 202


SFT design process
Two repeated and parallelizable tracks:
1. Data mixing: Take existing datasets, combine them with existing mix,
observe performance.
a. Substantial eort in trying to remove data and maintain
performance.
b. Start fully with mixing before curation.
2. Data curation: Take evaluations you are behind on and create new
data.

Lo, Bhagia, Lambert – Language Modeling Tutorial 203


Building SFT data
Simple part of SFT data is “quality” of response:

● Synthetic completions are used extensively. Strong models (GPT-4o,


Llama 3.1 405B, etc.) are becoming more useful for generating
completions to most instructions.
● Human data is needed for out-of-distribution or new tasks.
● [Optionally] Filter responses based on quality or correctness.

Largely undocumented is how to control “style” during SFT.

Lo, Bhagia, Lambert – Language Modeling Tutorial 204


Adaptation
Outline
1. Background & History
2. Prompts & Skill Selection
3. Supervised Finetuning (SFT) /
Instruction Finetuning (IFT)
4. Preference Finetuning (PreFT)
5. RL and advanced tuning
6. Open questions

205
The role of preference finetuning (PreFT)
Aligning to human preferences gives:
● Stronger training influence for style and chat evaluations
(e.g. ChatBotArena).
● Continue building capabilities of skills from SFT, but lower
absolute magnitude of improvements.

Lo, Bhagia, Lambert – Language Modeling Tutorial 206


π: LLM policy
RLHF objective πθ: base LLM
x: prompt
y: completion

Lo, Bhagia, Lambert – Language Modeling Tutorial 207


π: LLM policy
RLHF objective πθ: base LLM
x: prompt
y: completion

Optimize “reward” inspired ▲ ▲ Constrain the model to not


by human preferences trust the reward too much
(preferences are hard to
model)

Lo, Bhagia, Lambert – Language Modeling Tutorial 208


π: LLM policy
RLHF objective πθ: base LLM
x: prompt
y: completion

Optimize “reward” inspired ▲ ▲ Constrain the model to not


by human preferences trust the reward too much
(preferences are hard to
model)
Primary questions:
1. How to implement reward: r(x,y)
2. How to optimize reward

Lo, Bhagia, Lambert – Language Modeling Tutorial 209


Preference (reward) modeling
Can we just use supervised learning on scores?

● Assigning a scalar reward of how good a response is did not work


● Pairwise preferences are easy to collect and worked!
Score from
Chosen completion optimal reward model
Prompt
Key idea:

Probability ∝ reward
Rejected completion
Bradley Terry model:
Estimate probability that a given pairwise preference is true

Lo, Bhagia, Lambert – Language Modeling Tutorial 210


What if we just use gradient ascent on this equation?

Lo, Bhagia, Lambert – Language Modeling Tutorial 211


What if we just use gradient ascent on this equation?

The answer, with some math, is:


Direct Preference Optimization (DPO)

Released on May 29th 2023

Rafailov, Sharma, Mitchell et al. 2023

Lo, Bhagia, Lambert – Language Modeling Tutorial 212


DPO core facts
1. Extremely simple to
implement.
2. Scales nicely with existing
distributed training libraries.
3. Trains an implicit reward
function.

DPO is easier to iterate on, but


slightly underperforms online RL
methods on absolute potential.
Example code.
Rafailov, Sharma, Mitchell et al. 2023

Lo, Bhagia, Lambert – Language Modeling Tutorial 213


DPO model proliferation
DPO gave us:
1. A faster, easier to implement baseline for iterating on
post-training.
2. A massive proliferation of preference finetuning
models and methods in the open.
Continue through today.

Lo, Bhagia, Lambert – Language Modeling Tutorial 214


DPO model proliferation
HuggingFace H4’s Zephyr Beta
● First model to make a splash with DPO!
● Fine-tune of Mistral 7B with UltraFeedback dataset.
● Discovered weird low learning rates that are now standard (~5E-7).

UltraFeedback: https://arxiv.org/abs/2310.01377
Model: https://huggingface.co/HuggingFaceH4/zephyr-7b-beta Lo, Bhagia, Lambert – Language Modeling Tutorial 215
DPO model proliferation
Allen AI’s Tülu 2 70B
● First to scale DPO to 70B
parameters.
● State-of-the-art open model on
external benchmarks.
● Open models began to match and
surpass GPT-4.

Ivison et al. 2023. Tulu 2.


Lo, Bhagia, Lambert – Language Modeling Tutorial 216
DPO vs RL (PPO, REINFORCE, …)
PPO consistently outperforms
DPO, but at the cost of:
● Implementation complexity
● Memory usage, and
● Throughput

Lo, Bhagia, Lambert – Language Modeling Tutorial 217


DPO vs RL (PPO, REINFORCE, …)
PPO consistently outperforms
DPO, but at the cost of:
● Implementation complexity
● Memory usage, and
● Throughput

Normally can get ~1% improvement


from switching from DPO to PPO

Lo, Bhagia, Lambert – Language Modeling Tutorial 218


DPO vs RL (PPO, REINFORCE, …)
● DPO and PPO are very dierent optimizers.
● It is learning directly from preferences vs. using RL update rules.
● It is also not really online vs oine RL, but that is more muddled.

More discussion:
hps://twier.com/srush_nlp/status/1729896568956895370,
hps://www.interconnects.ai/p/the-dpo-debate,
hps://www.youtube.com/watch?v=YJMCSVLRUNs

Lo, Bhagia, Lambert – Language Modeling Tutorial 219


Human preferences vs LLM-as-a-judge
Both sources of preference data are used extensively.
In frontier labs:
● Human data used extensively as foundation of PreFT.
● Synthetic data used to enhance behaviors (e.g. Constitutional AI).

In open research:
● Synthetic data dominates due to price.
○ One LLM-as-a-judge label costs <1 cent.
○ One human datapoint costs $5-20.

Lo, Bhagia, Lambert – Language Modeling Tutorial 220


Human preferences vs LLM-as-a-judge

Human data: High noise, low bias.


Synthetic data: Low noise, high bias.

Lo, Bhagia, Lambert – Language Modeling Tutorial 221


Leading synthetic preference data
method: UltraFeedback
Key aspects:
● Diverse model pool for
completions.
● Diverse prompt pool.
● Extra performance is use
on-policy generations from
SFT checkpoint(s)

Cui, Ganqu, et al. 2024. UltraFeedback.


Lo, Bhagia, Lambert – Language Modeling Tutorial 222
Adaptation
Outline
1. Background & History
2. Prompts & Skill Selection
3. Supervised Finetuning (SFT) /
Instruction Finetuning (IFT)
4. Preference Finetuning (PreFT)
5. RL and advanced tuning
6. Open questions

223
RL finetuning
Reinforcement learning as a training objective other than just for
human preferences:
● OpenAI’s o1 and related models trained with "large-scale RL" for reasoning
● Finetuning based on verifiable outputs:
○ Tülu 3’s Reinforcement Learning with Verifiable Rewards (RLVR) or
○ OpenAI’s Reinforcement Finetuning API
○ Extensive research in specific domains: Code verification, VinePPO for
math, Quiet STaR, etc.

Lo, Bhagia, Lambert – Language Modeling Tutorial 224


RL finetuning
Reinforcement learning as a training objective other than just for
human preferences:
● OpenAI’s o1 and related models trained with "large-scale RL" for reasoning
● Finetuning based on verifiable outputs:
○ Tülu 3’s Reinforcement Learning with Verifiable Rewards (RLVR) or
○ OpenAI’s Reinforcement Finetuning API
○ Extensive research in specific domains: Code verification, VinePPO for
math, Quiet STaR, etc.

Lo, Bhagia, Lambert – Language Modeling Tutorial 225


Implementing RL finetuning

Lo, Bhagia, Lambert – Language Modeling Tutorial 226


Start with:
Standard RLHF

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 227
Standard RLHF (with RL details)

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 228
Adding verifiable rewards
Can we reward the policy for
being correct?

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 229
RL with verifiable rewards details
1. We do not use a reward
model! Just “environment”
reward. 2. Value model init.
from SFT / reward
model, checkpoint,
not random init.

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 230
Ground Truth RL

Lambert, Nathan et al. 2024. Tülu 3.


Lo, Bhagia, Lambert – Language Modeling Tutorial 231
RL finetuning’s vs. with classical RL
RL with verifiable rewards and all forms of RL finetuning are extremely
close to the classical definition of RL problems with reward given to
correct behavior.
Main innovation is in stability of implementation without degradation
across out-of-domain capabilities.

Lo, Bhagia, Lambert – Language Modeling Tutorial 232


The big picture of RL finetuning
● In many ways is very aligned with Yann’s cake metaphor.
○ Foundation / pretraining is unsupervised learning.
○ SFT / PreFT is supervised learning.
○ RL is cherry on top to finalize behavior.
● The boundary between post-training and other behaviors in
language models is changing

Lo, Bhagia, Lambert – Language Modeling Tutorial 233


Adaptation
Outline
1. Background & History
2. Prompts & Skill Selection
3. Supervised Finetuning (SFT) /
Instruction Finetuning (IFT)
4. Preference Finetuning (PreFT)
5. RL and advanced tuning
6. Open questions

234
Open questions in post-training
Methods and practices common in frontier laboratories but understudied in
academic research:

1. How to use methods like rejection sampling.

Lo, Bhagia, Lambert – Language Modeling Tutorial 235


Open questions in post-training
Methods and practices common in frontier laboratories but understudied in
academic research:

1. How to use methods like rejection sampling.


2. How to train a good reward model.

Lo, Bhagia, Lambert – Language Modeling Tutorial 236


Open questions in post-training
Methods and practices common in frontier laboratories but understudied in
academic research:

1. How to use methods like rejection sampling.


2. How to train a good reward model.
3. The importance of human preference data vs. LLM-as-a-judge.

Lo, Bhagia, Lambert – Language Modeling Tutorial 237


Open questions in post-training
Methods and practices common in frontier laboratories but understudied in
academic research:

1. How to use methods like rejection sampling.


2. How to train a good reward model.
3. The importance of human preference data vs. LLM-as-a-judge.
4. Style/character training of LMs.

Lo, Bhagia, Lambert – Language Modeling Tutorial 238


Conclusions

239
Research Still Needed
Science of Extend LMs Use LMs in
LMs Beyond Text Real World

LMs for
Improve LMs LM Agents
Science

Build Next
generation of LMs for Health Planning
LMs

Test-time Mitigate LMs Eicient


Inference Risk and Biases Models
240
Training language models is a TON of
details!
Stand on the shoulders of past research.
hps://github.com/allenai/awesome-open-lms

Lo, Bhagia, Lambert – Language Modeling Tutorial 241


[email protected]

… and many more (ordered arbitrarily)


Questions?
Try models built on this
hps://playground.allenai.org/
Resources: hps://github.com/allenai/awesome-open-lms

243
Extra, hidden RLHF
slides

244
State of open recipes for fine-tuning
No models in the top 60 of
LMSYS ChatBotArena with
open fine-tuning data.
We can change this!
(As of Dec. 9th, 2024)

Lo, Bhagia, Lambert – Language Modeling Tutorial 245


How we are improving MATH
Creating high-quality, general math questions:
● Prompt GPT-4o with “synthetic personas” to come up with
advanced math problems
● Filter top responses with Qwen 2 72B Math RM

PersonaHub:
hps://github.com/tencent-ailab/persona-hub

Lo, Bhagia, Lambert – Language Modeling Tutorial 246


Example Persona prompt

Lo, Bhagia, Lambert – Language Modeling Tutorial 247


Be careful with implementation!
● Now most people know there was a bug in HuggingFace
gradient accumulation
● We had found avoiding this bug was helping our models by >= 1
point average

hps://unsloth.ai/blog/gradient

Lo, Bhagia, Lambert – Language Modeling Tutorial 248


What training looks like
● Very, very low KL distance relative to standard RLHF
● Easy to saturate models on evals before RL stage → unclear how to sequence
training

Lo, Bhagia, Lambert – Language Modeling Tutorial 249


Incremental post-training vs. distillation
Reality of open fine-tuning:
● Distilling from a stronger model, such as GPT-4 or
Llama 405B is a substantial shortcut in
post-training.
● We have almost no progress on creating powerful
models without distillation.
Hope to address this soon.

Lo, Bhagia, Lambert – Language Modeling Tutorial 250


Tulu 1, Jun. 2023: Best open instruction tuning data
● Mixing and Analyzing open
datasets for stronger
instruction-tuned models

○ Human + synthetic data is


best!

● Understanding limitations of open Best recipe for


instruction-tuned models instruction data

Lo, Bhagia, Lambert – Language Modeling Tutorial 251


Eective SFT dataset examples

252
RLHF phase: SteerLM & Starling
Still plenty of models showing that PPO (and RL methods) outperforms DPO!

● SteerLM: Aribute conditioned fine-tuning


● Starling: Introduced new preference dataset, Nectar, and k-wise reward model loss function (i.e. moving beyond
pairwise preferences)
○ MT Bench 7B: 8.09 (beat every model except GPT-4 at the time)

SteerLM: https://huggingface.co/nvidia/SteerLM-llama2-13B
Starling: https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha Lo, Bhagia, Lambert – Language Modeling Tutorial 253
Inference with a language model

Tokenizer
& Model

Text to
tensor

Tensor
to text

Lo, Bhagia, Lambert – Language Modeling Tutorial 254


Lo, Bhagia, Lambert – Language Modeling Tutorial 255

You might also like