0% found this document useful (0 votes)

9 views255 pages

(10 December 2024, NeurIPS) Tutorial On Language Modeling

Uploaded by

sriboston

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views255 pages

(10 December 2024, NeurIPS) Tutorial On Language Modeling

Uploaded by

sriboston

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 255

Language Modeling

Kyle Lo – Akshita Bhagia – Nathan Lambert

Allen Institute of AI
[email protected]

Neural Information Processing Systems (NeurIPS)

10 December 2024

1
Open Closed API
Models Models

ELMO BERT GPT-2 GPT-3 Chinchilla ChatGPT GPT-4

Feb 2018 Oct 2018 Feb 2019 June 2020 March 2022 Nov 2022 March 2023

“Self-supervised Pretrain “In context

Generative “Data size is “Multimodal”
LM helps & ﬁne-tune learning”
downstream” as important as
parameter count”

*List not exhaustive 2

AI is here today due to open scientiﬁc
practices and fully open models

3
Are we done with scientiﬁc research
on LMs?

4
Goal of this tutorial is to build foundational
understanding for LM research.
Outline:
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

Lo, Bhagia, Lambert – Language Modeling Tutorial 5

Minimal LM basics

6
Prerequisites
We’re assuming you are comfortable with:

● Training ML models
○ e.g., “learning rate schedulers”, “AdamW”, “batch size”, “transformers”
● Core LM concepts
○ e.g. “next word prediction”, “tokenization”, “sequence length”
● PyTorch*

*Treat our code snippets like pseudocode, no guarantees they will run!

Lo, Bhagia, Lambert – Language Modeling Tutorial 7

Input and output tensors

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 8

Closer look at inputs

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 9

Closer look at inputs

Lo, Bhagia, Lambert – Language Modeling Tutorial 10

Closer look at inputs

Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 11

Closer look at inputs

Tokenizer

Input
embeddings

Lo, Bhagia, Lambert – Language Modeling Tutorial 12

Input and output tensors

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 13

Closer look at outputs

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, embedding_dim)

Model

Lo, Bhagia, Lambert – Language Modeling Tutorial 14

Closer look at outputs
Logits

Lo, Bhagia, Lambert – Language Modeling Tutorial 15

Closer look at outputs
Logits

argmax

Lo, Bhagia, Lambert – Language Modeling Tutorial 16

Closer look at outputs
Logits

argmax

Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 17

Input and output tensors

● Tokenizer
● Becomes tensor of dimension
(batch_size, seq_len, …)

Model
● Becomes tensor of dimension
(batch_size, seq_len, …)
● Tokenizer

Lo, Bhagia, Lambert – Language Modeling Tutorial 18

Training should otherwise look familiar…

Data tensor
shapes

Lo, Bhagia, Lambert – Language Modeling Tutorial 19

Speaker: Kyle Lo

Base – Data
1. Introduction (~5min)
2. Data (~40min)
3. Break (~5min)
4. Pretraining (~40min)
5. Break (~5min)
6. Post-training (~40min)
7. Conclusions & Q/A (~15min)

20
What is LM data?

21
Looking at the data

Lo, Bhagia, Lambert – Language Modeling Tutorial 22

Organized
hierarchically

Unstructured
text

Lo, Bhagia, Lambert – Language Modeling Tutorial 23

Even
structured
tasks as one
long string

Lo, Bhagia, Lambert – Language Modeling Tutorial 24

The data curation loop

acquire data

transform the data Repeat per source

(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 25

001 What is “good” data?

002 Data acquisition

003 Data transformation

004 Making good decisions

005 Takeaways

26
What is “good” data?

27
non iid

Scale “Quality”
ac

ess
ce

acc
s s

Constraints

Lo, Bhagia, Lambert – Language Modeling Tutorial 28

Data acquisition

29
The data curation loop

acquire data

transform the data

(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 30

Public APIs & Bulk Dumps
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath

1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 31
Largest contributors of data?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath

1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP'’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 32
Other major contributors of data?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath

1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 33
Data providers breakdown
● Web scrapers (80-100% of the data)
○ Internet Archive (1996), Common Crawl (2007), PushShift (2015),
Software Heritage (2016)

● User-provided content (<1%)

○ Wikipedia, arXiv

● Open publishers (<5%)

○ PubMed, Project Gutenberg, Semantic Scholar

Lo, Bhagia, Lambert – Language Modeling Tutorial 34

Datasets also reuse prior datasets
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath

1B Common Crawl Llema 15B Common Crawl
(Jul 2019) ROOTS) (Oct 2023)
C4 RedPajama v2
T5, FLAN-T5 175B Common Crawl - 30T Common Crawl
(Oct 2019) (Oct 2023)
Common Crawl, arXiv, Amber C4, RefinedWeb,
Pile GPT-J, GPT-NeoX, Amber 1.3T
387B PubMed, Books3, (Dec 2023) the Stack, RedPajama v1
(Dec 2020) Pythia
Gutenberg, Wikipedia, etc…
Dolma, RefinedWeb, RP’s
The Stack v1 Dolma 1.7
StarCoder 200B Software Heritage OLMo 0424 2.3T StackExchange, Flan,
(Nov 2022) (Apr 2024)
OpenWebMath, …
Common Crawl, C4,
Github, arXiv, Gutenberg, FineWeb
RedPajama v1 - 15T Common Crawl
INCITE 1.2T Books3, Wikipedia, (May 2024)
(Apr 2023)
Internet Archive (Stack RedPajama v2, Dolma,
Exchange) Matrix
CulturaX, Amber,
(May 2024) MAP-Neo 4.7T
RefinedWeb SlimPajama, Falcon,
Falcon 580B* Common Crawl
(Jun 2023) crawled Chinese web
Common Crawl, C4, DCLM
DCLM-Baseline 4T Common Crawl
Semantic Scholar, (Jun 2024)
Dolma
OLMo 3.1T Pushshift Reddit,
(Aug 2023)
Gutenberg, the Stack,
Wikipedia, Wikibooks Lo, Bhagia, Lambert – Language Modeling Tutorial 35
What about the crawling data yourself?
Dataset Example LMs Tokens Sources Dataset Example LMs Tokens Sources

OSCAR BLOOM (via OpenWebMath

37
Broad & wide crawls are Domain-speciﬁc crawls are
easiest to scale easiest to ensure quality

Common Crawl, Internet Archive, Math exercises, code notebooks,

Software Heritage, big tech company Q&A forum posts, Stack Exchange

38
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
How to get the content?
<p>My Title</p>.

<a onclick=”postMyContent()”>Click Me.</a>

“My Title. Click Me. Lorem ipsum dolor sit amet, consectetur
adipiscing elit, sed do eiusmod tempor incididunt ut labore et
dolore magna aliqua. Ut enim ad m…”

Lo, Bhagia, Lambert – Language Modeling Tutorial 39

Coaxing content from JS
requires site-speciﬁc logic

40
What websites to target?
Quality Volume Diiculty Coverage

example.com Highly curated. ~ 100,000 pages Full comment expansion

Substantial non-text information* 1,800 words/page makes slower. ~ 100 words /
sec **

example.org Highly curated. ~100,000 pages, 742 w/p ~ 186 words / sec

example.net Subjectively poorer. ~200,000 pages, 1,280 w/p ~ 256 words/sec

example.ai Walled; evidence of generated content Walled Non-web modality Walled

example.cat Highly curated. Reports 3MM books. Free range crawl Maybe not very crawlable?
Substantial non-text information, PDF

example.gov High variance (generally curated) Reports 2.5MM sites over 92 Free range crawl Single URLs vs. Linked Sites;
languages. High variance on highly parallelizable
words/doc.

example.xyz High variance (generally curated) 12,000 English Urls reported. Free range crawl Single URLs vs. Linked Sites;
High variance. highly parallelizable
Lo, Bhagia, Lambert – Language Modeling Tutorial 41
Broad & wide crawls are Domain-speciﬁc crawls are
easiest to scale easiest to ensure quality

Common Crawl, Internet Archive, Math exercises, code notebooks,

Software Heritage, big tech company Q&A forum posts, Stack Exchange

42
hps://www.reddit.com/r/cats/comments/10dpv9p/yall_werent_kidding_when_you_said_cats_love_churus/ hps://www.reddit.com/r/cats/comments/ytcv0n/got_my_cat_a_gravity_feeder_why/
Lo, Bhagia, Lambert – Language Modeling Tutorial
Harder to get data via crawling

Longpre et. al. 2024. Consent in Crisis: The Rapid Decline of the AI Data Commons. Data Provenance Initiative.
Lo, Bhagia, Lambert – Language Modeling Tutorial 43
Widening inequality in data access

Lo, Bhagia, Lambert – Language Modeling Tutorial 44

Data transformation

45
The data curation loop

acquire data

transform the data

(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 46

Linearization

47
What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 48

What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 49

What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 50

What does language model data look like?

hp://clizbeats.com/warner-music-group-strengthens-global-technology-data-expertise-9821/ Lo, Bhagia, Lambert – Language Modeling Tutorial 51

Poor linearization can be irrecoverable
Before After

Choppy. Sentences split

to many newlines. A lot
of undesirable website
content.
Before
.\" @(#)arithmetic.6 8.1 (Berkeley) 5/31/93 .\" $FreeBSD:
src/games/arithmetic/arithmetic.6,v 1.3 1999/08/27 23:28:52 peter Exp $ metadata
.\" $DragonFly: src/games/arithmetic/arithmetic.6,v 1.2 2003/06/17
04:25:22 dillon Exp $ .\" .TH ARITHMETIC 6 "May 31, 1993" .UC 4 .SH NAME
arithmetic \- quiz on simple arithmetic .SH SYNOPSIS .B arithmetic .B [ \-o
+\-x/ .B ] .B [ \-r range .B ] .SH DESCRIPTION .I Arithmetic asks you to solve
problems in simple arithmetic. Each question must be answered correctly
before going on to the next. After every 20 problems, it prints the score so
far and the time taken. You can quit at any time by typing the interrupt or
end-of-file character. .PP The options are as follows: .TP \-o By default, .I
arithmetic asks questions on addition of numbers from 0 to 10, and
corresponding subtraction. By supplying one or more of the characters .BR extra chars
+\-x/ , you can ask for problems in addition, subtraction, multiplication, and
division, respectively. If you give one of these characters more than once,
that kind of problem will be asked correspondingly more often. .TP \-r If a .I
range is supplied, .I arithmetic selects the numbers in its problems in the
following way. For addition and multiplication, the numbers to be added or
multiplied are between 0 and .IR range , inclusive. For subtraction and missing word
division, both the required result and the number to divide by or subtract
will be between 0 and .IR range . (Of course, .I arithmetic will not ask you to
divide by 0.) The default .I range is 10. .PP When you get a problem wrong, .I
arithmetic will remember the numbers involved, and will tend to select
those numbers more often than others, in problems of the same sort.
Eventually it will forgive and forget. .PP .I Arithmetic cannot be persuaded
to tell you the right answer. You must work it out for yourself. .SH
DIAGNOSTICS ``What?'' if you get a question wrong. ``Right!'' if you get it
Before After
.\" @(#)arithmetic.6 8.1 (Berkeley) 5/31/93 .\" $FreeBSD: The arithmetic command provides a quiz on simple arithmetic. Each question
src/games/arithmetic/arithmetic.6,v 1.3 1999/08/27 23:28:52 peter Exp $ must be answered correctly before proceeding to the next. After every 20
.\" $DragonFly: src/games/arithmetic/arithmetic.6,v 1.2 2003/06/17 problems, it displays the score and the time taken. You can quit at any time by
04:25:22 dillon Exp $ .\" .TH ARITHMETIC 6 "May 31, 1993" .UC 4 .SH NAME typing the interrupt or end-of-file character.
arithmetic \- quiz on simple arithmetic .SH SYNOPSIS .B arithmetic .B [ \-o
+\-x/ .B ] .B [ \-r range .B ] .SH DESCRIPTION .I Arithmetic asks you to solve The options are as follows:
problems in simple arithmetic. Each question must be answered correctly
before going on to the next. After every 20 problems, it prints the score so - \-o: By default, arithmetic asks questions on addition of numbers from 0 to 10,
far and the time taken. You can quit at any time by typing the interrupt or and corresponding subtraction. By supplying one or more of the characters
end-of-file character. .PP The options are as follows: .TP \-o By default, .I
+\-x/, you can ask for problems in addition, subtraction, multiplication, and
arithmetic asks questions on addition of numbers from 0 to 10, and
division, respectively. If you give one of these characters more than once, that
corresponding subtraction. By supplying one or more of the characters .BR
kind of problem will be asked correspondingly more often.
+\-x/ , you can ask for problems in addition, subtraction, multiplication, and
- \-r: If a range is supplied, arithmetic selects the numbers in its problems in the
division, respectively. If you give one of these characters more than once,
following way. For addition and multiplication, the numbers to be added or
that kind of problem will be asked correspondingly more often. .TP \-r If a .I
multiplied are between 0 and the range, inclusive. For subtraction and division,
range is supplied, .I arithmetic selects the numbers in its problems in the
both the required result and the number to divide by or subtract will be between
following way. For addition and multiplication, the numbers to be added or
0 and the range. (Of course, arithmetic will not ask you to divide by 0.) The
multiplied are between 0 and .IR range , inclusive. For subtraction and
default range is 10.
division, both the required result and the number to divide by or subtract
will be between 0 and .IR range . (Of course, .I arithmetic will not ask you to
When you get a problem wrong, arithmetic will remember the numbers involved
divide by 0.) The default .I range is 10. .PP When you get a problem wrong, .I
and will tend to select those numbers more often than others in problems of the
arithmetic will remember the numbers involved, and will tend to select
those numbers more often than others, in problems of the same sort. same sort. Eventually, it will forgive and forget.

Eventually it will forgive and forget. .PP .I Arithmetic cannot be persuaded

to tell you the right answer. You must work it out for yourself. .SH Arithmetic cannot be persuaded to tell you the right answer. You must work it out

DIAGNOSTICS ``What?'' if you get a question wrong. ``Right!'' if you get it for yourself.
What about PDFs & Scanned Docs?
Old Scanned Docs Using Classical OCR Pipeline

Christians behaving themselves like Ma borne- a . t >.

dans.3 ."5/0-

4. The natives soon had reason to suspect the viceroy,

viceroy’s sincerity in his expressions of regret
at the proceedings of which they complained. &»"«■'
For about this time the Dominican friars, under
pretence of building a. convent, erected a fortress
on the island of Sol or, which, as soon as
ﬁnished, the viceroy garrisoned with a strong
force. The natives' very naturally felt indig-
S nant at this additional encroachment, and took
every opportunity to aack the garrison. The
monks, forgetful/ of their peaceable profession,
took an active part in these skirmishes, and
many of tbg.tr fell sword in hand.

The i'lﬁnomedan faith has been appropriately

Christians behaving themselves like Ma borne- a . t >.

dans.3 ."5/0-

4. The natives soon had reason to suspect the viceroy,

viceroy’s sincerity in his expressions of regret
Random symbols
at the proceedings of which they complained. &»"«■'
For about this time the Dominican friars, under
pretence of building a. convent, erected a fortress
on the island of Sol or, which, as soon as
ﬁnished, the viceroy garrisoned with a strong
force. The natives' very naturally felt indig-
S nant at this additional encroachment, and took
every opportunity to aack the garrison. The
monks, forgetful/ of their peaceable profession,
took an active part in these skirmishes, and
many of tbg.tr fell sword in hand.

The i'lﬁnomedan faith has been appropriately Bogus words

entitled., The religion of the sword,; and with
equal propriety may we so designate the re-
.■. i'gv.m of these belligerent friars. The Portu-
gu writers give an account of one of their
missionaries, Fernando Vinagre, who was as
prompt in the ﬁeld of bale as at the baptismal
font. This man, though a secular priest, undertook
the command of a squadron that was
I sent to the assistance of the rajah of Tidore,4 on
which occasion he is said to have acted in the
twofold capacity of a great commander, and a
great apostle, at one time appearing in armour,
; at another in a surplice; and even occasionally,
baptizing the converts of his sword without puing
o his armour, but covering it with his
ecclesiastical vest. In this crusade5 he had two Scrambled footnotes
3 Geddea History, &c., pp. 24—27.
Pudet hsec opprobria nobis
Vel dici potuisse.
* Called T a d u ra or D a c o , an island in the Indian Ocean,
one of the Moluccas
5 ‘ These a la D ra g o o n conversions.’ Geddes' History, p. 27.
Old Scanned Docs What we would like instead
Christians behaving themselves like Mahomedans.³ No more weird symbols
The natives soon had reason to suspect the viceroy’s sincerity in his
expressions of regret at the proceedings of which they complained.
For about this time the Dominican friars, under pretence of building a
convent, erected a fortress on the island of Solor, which, as soon as
ﬁnished, the viceroy garrisoned with a strong force. The natives very
naturally felt indignant at this additional encroachment, and took
every opportunity to aack the garrison. The monks, forgetful of their
peaceable profession, took an active part in these skirmishes, and
many of them fell sword in hand.

The Mahomedan faith has been appropriately entitled, *The religion of Proper OCR
the sword*; and with equal propriety may we so designate the religion
of these belligerent friars. The Portuguese writers give an account of despite bad
one of their missionaries, Fernando Vinagre, who was as prompt in the lighting
ﬁeld of bale as at the baptismal font. This man, though a secular
priest, undertook the command of a squadron that was sent to the
assistance of the rajah of Tidore,⁴ on which occasion he is said to have
acted in the twofold capacity of a great commander, and a great
apostle, at one time appearing in armour, at another in a surplice; and
even occasionally, baptizing the converts of his sword without puing
o his armour, but covering it with his ecclesiastical vest. In this
crusade⁵ he had two

---
³ Geddes History, &c., pp. 24—27. Proper handling of footnotes
Pudet haec opprobria nobis
Vel dici potuisse.
⁴ Called *Tadure* or *Daco*, an island in the Indian Ocean, one of the
Moluccas
⁵ 'These a la Dragoon conversions.' Geddes' History, p. 27.
Is there a “best” linearization?
Filtering

62
Filter low-quality content

bufvc.ac.uk/allbufvc/search.php/item?q=Discussion
Lo, Bhagia, Lambert – Language Modeling Tutorial
hps://ﬁnance.yahoo.com 63
Filter low-quality content

Lo, Bhagia, Lambert – Language Modeling Tutorial 64

Filter undesirable content
Toxic / NSFW Personally identiﬁable information

hps://i.redd.it/r76y8e47qrvb1.jpg hps://www.reddit.com/r/interestingasfuck/comments/10slutr/the_cats_that_sailed_on_ships_until_the_mid20th/
Lo, Bhagia, Lambert – Language Modeling Tutorial 65
Filter duplicate data

Lo, Bhagia, Lambert – Language Modeling Tutorial 66

How much ﬁltering?

175 TB ≈65 x 2.7 TB

CommonCrawl Dolma 1.7

240T tokens ≈65 x 3.8T tokens

CommonCrawl DCLM

Lo, Bhagia, Lambert – Language Modeling Tutorial 67

The data curation loop

acquire data

transform the data

(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 68

The data curation loop

acquire data
Language ﬁltering

Quality ﬁltering
transform the data
(data intervention)
Safety ﬁltering

Deduplication
run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 69

Use small text classiﬁers for everything

Lo, Bhagia, Lambert – Language Modeling Tutorial 70

Use small text classiﬁers for everything

fastText
● 2,000 docs per second per CPU
● $0.04/hr ($8.5/hr for c7i instance, 192 cores)

BERT-Base
● 1,600 docs per second per H100
● $2.50/hr

Lo, Bhagia, Lambert – Language Modeling Tutorial 71

Use small text classiﬁers for everything

Lo, Bhagia, Lambert – Language Modeling Tutorial 72

Two text classiﬁer ideologies

“Give me more like this” “Give me less like this”

Positive: Llama-labeled “Edu” content Positive: Diverse set of “High Quality” docs
Negative: Llama-labeled “non-Edu” content Negative: Randomly sampled Common Crawl

Lo, Bhagia, Lambert – Language Modeling Tutorial 73

Label whole documents with small models

Lo, Bhagia, Lambert – Language Modeling Tutorial 74

Common format for all labeled output

Lo, Bhagia, Lambert – Language Modeling Tutorial 75

Common format for all labeled output
fastText comes with trained weights.
Dolma trained fastText classiﬁer on Jigsaw.
Beer than cld2 and cld3

C4, Gopher used rules. Dolma, FineWeb, DCLM all use some
commonsense text heuristics.
FineWeb used Llama 70B to generate labels.
Distill into fastText.

DCLM trained fastText on OpenHermes + ELI5.

Lo, Bhagia, Lambert – Language Modeling Tutorial 76
Label passages with small models

Lo, Bhagia, Lambert – Language Modeling Tutorial 77

Common format for all labeled output

Lo, Bhagia, Lambert – Language Modeling Tutorial 78

Finding duplicates is also tagging

BloomFilter, minHash, exact Match, …

Lo, Bhagia, Lambert – Language Modeling Tutorial 79

Assembling the ﬁnal dataset

“remove all documents

with English score < 0.5
and >70% of lines are
ungrammatical or are
duplicated more than 20
times and…”

Lo, Bhagia, Lambert – Language Modeling Tutorial 80

Sneaky data issues

81
Side-eects of ﬁlters (e.g. “quality”)

Lo, Bhagia, Lambert – Language Modeling Tutorial 82

Side-eects of ﬁlters (e.g. “quality”)

Lo, Bhagia, Lambert – Language Modeling Tutorial 83

Side-eects of ﬁlters (e.g. deduplication)

hps://nationalstocksign.com/terms.php
hps://www.aximtrd.com/term-conditions

hps://www.lawinsider.com/clause/deﬁnitions
hps://www.picturesofengland.com/agreements/ExtendedLicence

Lo, Bhagia, Lambert – Language Modeling Tutorial 84

Trade-o between ﬁlter speed vs quality

20 docs/sec 5000 docs/sec

Subramani et. al. 2023. Detecting Personal Information in Training Corpora: an Analysis. TrustNLP workshop at ACL.
Lo, Bhagia, Lambert – Language Modeling Tutorial 85
Coarse text is faster but can hide issues
Scientiﬁc in English

NSFW in Chinese

Lo, Bhagia, Lambert – Language Modeling Tutorial 86

Curation steps to Data Pipelines

87
Each data source requires own pipeline

GitHub Code
Lo, Bhagia, Lambert – Language Modeling Tutorial 88
Each data source requires own pipeline

Common Crawl Web Documents

Lo, Bhagia, Lambert – Language Modeling Tutorial 89
Making good decisions

90
The data curation loop

acquire data

transform the data

(data intervention)

run experiment
(pretrain LM)

Lo, Bhagia, Lambert – Language Modeling Tutorial 91

“Data ablations”

1B LMs / 150B tokens ~1 day on 64 A100 (40GB) w/ 400 Gbps

Lo, Bhagia, Lambert – Language Modeling Tutorial 92

Easier said than done…

93
Some data interventions hard to “test”
deduplication

No eect from
deduplication?

Lo, Bhagia, Lambert – Language Modeling Tutorial 94

Small models are cheap!

So lile noise!

Lo, Bhagia, Lambert – Language Modeling Tutorial 95

But not all conﬁgurations work for every task

Omg

Lo, Bhagia, Lambert – Language Modeling Tutorial 96

Emerging threads in data

97
Data curriculum actually works?

Stage 1: 4T tokens on this…

Lo, Bhagia, Lambert – Language Modeling Tutorial 98

Data curriculum actually works? Stage 2: 300B tokens on this…

Stage 1: 4T tokens on this…

Lo, Bhagia, Lambert – Language Modeling Tutorial 99

Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

Lo, Bhagia, Lambert – Language Modeling Tutorial 100

Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data

Lo, Bhagia, Lambert – Language Modeling Tutorial 101

Data curriculum actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data

● New data sources you

didn’t have during Stage 1

Lo, Bhagia, Lambert – Language Modeling Tutorial 102

Synthetic actually works? Stage 2: 300B tokens on this…

● Instruction data

● Subset of Stage 1 data