0% found this document useful (0 votes)

2 views

Lecture 4

Uploaded by

Beekan Gammadaa

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Lecture 4

Uploaded by

Beekan Gammadaa

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 37

Language Model

Content
• Word prediction task
• Language modeling (N-grams)
– N-gram introduction
– The chain rule
– Model evaluation
– Smoothing

2
Word Prediction

• Guess the next word...

– ... I notice three guys standing on the ???
• There are many sources of knowledge that
can be used to inform this task, including
arbitrary world knowledge.
• But it turns out that you can do pretty well
by simply looking at the preceding words
and keeping track of some fairly simple
counts.
3
Word Prediction

• We can formalize this task using what are called

N-gram models.
• N-grams are token sequences of length N.
• Our earlier example contains the following 2-
grams (bigrams)
– (I notice), (notice three), (three guys), (guys
standing), (standing on), (on the)
• Given knowledge of counts of N-grams such as
these, we can guess likely next words in a
sequence.
4
N-Gram Models
• More formally, we can use knowledge of
the counts of N-grams to assess the
conditional probability of candidate words
as the next word in a sequence.
• Or, we can use them to assess the
probability of an entire sequence of words.
– Pretty much the same thing as we’ll see...

5
Applications
• It turns out that being able to predict the next word
(or any linguistic unit) in a sequence is an
extremely useful thing to be able to do.
• As we’ll see, it lies at the core of the following
applications
– Automatic speech recognition
– Handwriting and character recognition
– Spelling correction
– Machine translation
– And many more.

6
Counting
• Simple counting lies at the core of any
probabilistic approach. So let’s first take a
look at what we’re counting.
– He stepped out into the hall, was delighted to
encounter a water brother.
• 13 tokens, 15 if we include “,” and “.” as separate
tokens.
• Assuming we include the comma and period, how
many bigrams are there?

7
Counting
• Not always that simple
– I do uh main- mainly business data processing
• Spoken language poses various challenges.
– Should we count “uh” and other fillers as tokens?
– What about the repetition of “mainly”? Should such do-overs
count twice or just once?
– The answers depend on the application.
• If we’re focusing on something like ASR to support indexing for
search, then “uh” isn’t helpful (it’s not likely to occur as a query).
• But filled pauses are very useful in dialog management, so we might
want them there.

8
Counting: Types and Tokens
• How about
– They picnicked by the pool, then lay back on
the grass and looked at the stars.
• 18 tokens (again counting punctuation)
• But we might also note that “the” is used 3
times, so there are only 16 unique types (as
opposed to tokens).
• In going forward, we’ll have occasion to
focus on counting both types and tokens of
both words and N-grams.
9
Counting: Wordforms
• Should “cats” and “cat” count as the same
when we’re counting?
• How about “geese” and “goose”?
• Some terminology:
– Lemma: a set of lexical forms having the same
stem, major part of speech, and rough word
sense
– Wordform: fully inflected surface form
• Again, we’ll have occasion to count both
lemmas and wordforms
10
Counting: Corpora
• So what happens when we look at large bodies of text
instead of single utterances?
• Brown et al (1992) large corpus of English text
– 583 million wordform tokens
– 293,181 wordform types
• Google
– Crawl of 1,024,908,267,229 English tokens
– 13,588,391 wordform types
• That seems like a lot of types... After all, even large dictionaries of English have
only around 500k types. Why so many here?

• Numbers
• Misspellings
• Names
• Acronyms
• etc
11
Language Modeling
• Back to word prediction
• We can model the word prediction task as
the ability to assess the conditional
probability of a word given the previous
words in the sequence
– P(wn|w1,w2…wn-1)
• We’ll call a statistical model that can assess
this a Language Model

12
Language Modeling
• How might we go about calculating such a
conditional probability?
– One way is to use the definition of conditional
probabilities and look for counts. So to get
– P(the | its water is so transparent that)
• By definition that’s
P(its water is so transparent that the)
P(its water is so transparent that)
We can get each of those from counts in a large
corpus.
13
Very Easy Estimate
• How to estimate?
– P(the | its water is so transparent that)

P(the | its water is so transparent that) =

Count(its water is so transparent that the)
Count(its water is so transparent that)

14
Very Easy Estimate
• According to Google those counts are 5/9.
– Unfortunately... 2 of those were to these
slides... So maybe it’s really
– 3/7
– In any case, that’s not terribly convincing due
to the small numbers involved.

15
Language Modeling
• Unfortunately, for most sequences and for
most text collections we won’t get good
estimates from this method.
– What we’re likely to get is 0. Or worse 0/0.
• Clearly, we’ll have to be a little more clever.
– Let’s use the chain rule of probability
– And a particularly useful independence
assumption.

16
The Chain Rule

• Recall the definition of conditional probabilities

P ( A^ B )
• Rewriting: P( A | B) 
P( B)

17
The Chain Rule

P(its water was so transparent)=

P(its)*
P(water|its)*
P(was|its water)*
P(so|its water was)*
P(transparent|its water was so)

18
Unfortunately

• There are still a lot of possible sentences

• In general, we’ll never be able to get enough
data to compute the statistics for those longer
prefixes
– Same problem we had for the strings themselves

19
Independence Assumption
• Make the simplifying assumption
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|a)
• Or maybe
– P(lizard|
the,other,day,I,was,walking,along,and,saw,a) =
P(lizard|saw,a)
• That is, the probability in question is
independent of its earlier history.

20
Independence Assumption

• This particular kind of independence assumption is

called a Markov assumption after the Russian
mathematician Andrei Markov.

21
Markov Assumption

So for each component in the product replace with

the approximation (assuming a prefix of N)

n 1 n 1
P(wn | w 1 ) P(wn | w n N 1 )
Bigram version

n 1
P(w n | w 1 ) P(w n | w n 1 )

22
Estimating Bigram Probabilities

• The Maximum Likelihood Estimate (MLE)

count(w i 1,w i )
P(w i | w i 1) 
count(w i 1 )

23
An Example
• <s> I am Sam </s>
• <s> Sam I am </s>
• <s> I do not like green eggs and ham </s>

24
Maximum Likelihood Estimates
• The maximum likelihood estimate of some parameter of a
model M from a training set T
– Is the estimate that maximizes the likelihood of the training set T given
the model M
• Suppose the word Chinese occurs 400 times in a corpus of a
million words (Brown corpus)
• What is the probability that a random word from some other
text from the same distribution will be “Chinese”
• MLE estimate is 400/1000000 = .004
– This may be a bad estimate for some other corpus
• But it is the estimate that makes it most likely that “Chinese”
will occur 400 times in a million word corpus.

25
Berkeley Restaurant Project Sentences

• can you tell me about any good cantonese restaurants

close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are
available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day

26
Bigram Counts
• Out of 9222 sentences
– Eg. “I want” occurred 827 times

27
Bigram Probabilities
• Divide bigram counts by prefix unigram
counts to get probabilities.

28
Bigram Estimates of Sentence Probabilities

• P(<s> I want english food </s>) =

30
Evaluation
• How do we know if our models are any
good?
– And in particular, how do we know if one
model is better than another.
• Well Shannon’s game gives us an intuition.
– The generated texts from the higher order
models sure look better. That is, they sound
more like the text the model was obtained from.
– But what does that mean? Can we make that
notion operational?

31
Evaluation

• Standard method
– Train parameters of our model on a training set.
– Look at the models performance on some new data
• This is exactly what happens in the real world; we want to know
how our model performs on data we haven’t seen
– So use a test set. A dataset which is different than our
training set, but is drawn from the same source
– Then we need an evaluation metric to tell us how well
our model is doing on the test set.
• One such metric is perplexity

32
Unknown Words
• But once we start looking at test data, we’ll
run into words that we haven’t seen before
(pretty much regardless of how much
training data you have.
• With an Open Vocabulary task
– Create an unknown word token <UNK>
– Training of <UNK> probabilities
• Create a fixed lexicon L, of size V
– From a dictionary or
– A subset of terms from the training set
• At text normalization phase, any training word not in L changed to
<UNK>
• Now we count that like a normal word
– At test time
• Use UNK counts for any word not in training
33
Zero Counts
• Back to Shakespeare
– Recall that Shakespeare produced 300,000 bigram
types out of V2= 844 million possible bigrams...
– So, 99.96% of the possible bigrams were never seen
(have zero entries in the table)
– Does that mean that any sentence that contains one of
those bigrams should have a probability of 0?

34
Laplace-Smoothed Bigram Counts

35
Laplace-Smoothed Bigram Probabilities

36
Backoff and Interpolation
• Another really useful source of knowledge
• If we are estimating:
– trigram p(z|x,y)
– but count(xyz) is zero
• Use info from:
– Bigram p(z|y)
• Or even:
– Unigram p(z)
• How to combine this trigram, bigram,
unigram info in a valid fashion?
37

Divinity and Experience The Religion of The Dinka PDF
100% (3)
Divinity and Experience The Religion of The Dinka PDF
177 pages
Winn, Edith Lynwood - How To Study Kreutzer PDF
100% (4)
Winn, Edith Lynwood - How To Study Kreutzer PDF
76 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
6.Chapter6_LanguageModel
No ratings yet
6.Chapter6_LanguageModel
33 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
3_LM_2024
No ratings yet
3_LM_2024
78 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Chapter Four 1
No ratings yet
Chapter Four 1
91 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
lm24aug
No ratings yet
lm24aug
84 pages
NLP_Week_03
No ratings yet
NLP_Week_03
33 pages
N-Grams and Corpus Linguistics: Julia Hirschberg
No ratings yet
N-Grams and Corpus Linguistics: Julia Hirschberg
47 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
13 Ngramlm
No ratings yet
13 Ngramlm
27 pages
Brief Introduction to LLM
No ratings yet
Brief Introduction to LLM
69 pages
N-Grams and Smoothing: CSC 371: Spring 2012
No ratings yet
N-Grams and Smoothing: CSC 371: Spring 2012
39 pages
lecture5-ngrams
No ratings yet
lecture5-ngrams
40 pages
CME4408 P5 N-grams Smooting
No ratings yet
CME4408 P5 N-grams Smooting
43 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Chapter 03-Number System
No ratings yet
Chapter 03-Number System
88 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
88 pages
08 Language Models
No ratings yet
08 Language Models
69 pages
ICE 2209 Lecture 1 MMA1
No ratings yet
ICE 2209 Lecture 1 MMA1
56 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
NLP
No ratings yet
NLP
46 pages
N Grams
No ratings yet
N Grams
51 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
NLP m2
No ratings yet
NLP m2
74 pages
Lec 1.1.2
No ratings yet
Lec 1.1.2
44 pages
NLP_Week_02
No ratings yet
NLP_Week_02
55 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Session 2-3 Language Modeling
No ratings yet
Session 2-3 Language Modeling
69 pages
2 N-Gram
No ratings yet
2 N-Gram
70 pages
N-Gram in NLP
No ratings yet
N-Gram in NLP
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
44 pages
Deeplearning Ai
No ratings yet
Deeplearning Ai
69 pages
Applied Natural Language Processing: Barbara Rosario
No ratings yet
Applied Natural Language Processing: Barbara Rosario
39 pages
Cs224n 2023 Lecture05 RNNLM
No ratings yet
Cs224n 2023 Lecture05 RNNLM
68 pages
Lecture 23-24
No ratings yet
Lecture 23-24
53 pages
Ngrams
100% (1)
Ngrams
22 pages
Lecture 5
No ratings yet
Lecture 5
56 pages
NLP_Week_02
No ratings yet
NLP_Week_02
54 pages
Lecture 2. N-Gram LMs
No ratings yet
Lecture 2. N-Gram LMs
77 pages
Language Model PDF
No ratings yet
Language Model PDF
76 pages
2-Regular expressions, Text Normalization, Edit Distance
No ratings yet
2-Regular expressions, Text Normalization, Edit Distance
42 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
NLP_Module 2(1)
No ratings yet
NLP_Module 2(1)
77 pages
Module 2 - v2
No ratings yet
Module 2 - v2
126 pages
AI KB Systems
No ratings yet
AI KB Systems
47 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
SLoSP 2007 1
No ratings yet
SLoSP 2007 1
42 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
MOD-1
No ratings yet
MOD-1
71 pages
Language Modeling
No ratings yet
Language Modeling
88 pages
Logic Programming With Prolog: Resolution, Unification, Backtracking
No ratings yet
Logic Programming With Prolog: Resolution, Unification, Backtracking
34 pages
IS 7118 Unit-4 N-Grams
100% (2)
IS 7118 Unit-4 N-Grams
93 pages
Brain Twisters and Teasers: A Logical Workout for the Mind
From Everand
Brain Twisters and Teasers: A Logical Workout for the Mind
Jennifer Henson
No ratings yet
Drugs Acting On Adrenrgic Nervous System: Presented by
No ratings yet
Drugs Acting On Adrenrgic Nervous System: Presented by
63 pages
Oh6mptsb LP Fall05 PDF
No ratings yet
Oh6mptsb LP Fall05 PDF
32 pages
Worflow Configuration
No ratings yet
Worflow Configuration
14 pages
20.1.4 Other Pollution CIE IGCSE Biology Revisi
No ratings yet
20.1.4 Other Pollution CIE IGCSE Biology Revisi
1 page
MAT235 - Group Assignment JUNE2023
No ratings yet
MAT235 - Group Assignment JUNE2023
2 pages
AP60T03GH/J: ELECTROCS 6s2wsws2w0t03 - b1444
No ratings yet
AP60T03GH/J: ELECTROCS 6s2wsws2w0t03 - b1444
3 pages
St. Augustine Foundation Colleges of Nueva Ecija, Inc.: The Problem and It'S Background
No ratings yet
St. Augustine Foundation Colleges of Nueva Ecija, Inc.: The Problem and It'S Background
55 pages
I. Assessment of MVA and Severity of MS
No ratings yet
I. Assessment of MVA and Severity of MS
4 pages
Horkheimer, Max - Critical Theory (Continuum, 1972)
No ratings yet
Horkheimer, Max - Critical Theory (Continuum, 1972)
312 pages
(Proposal) Students Perception On Teachers Indirect Feedback in English Classroom at SMP N 1 Bukittinggi. Hikmah Silfira 2316109
No ratings yet
(Proposal) Students Perception On Teachers Indirect Feedback in English Classroom at SMP N 1 Bukittinggi. Hikmah Silfira 2316109
46 pages
Chronicle Books UK Spring 2016 Frontlist Catalogue
No ratings yet
Chronicle Books UK Spring 2016 Frontlist Catalogue
156 pages
The Method of Coordinates 4D
No ratings yet
The Method of Coordinates 4D
77 pages
Determination of PH of Water
100% (2)
Determination of PH of Water
3 pages
Doosan Crawler Excavator DX300 Performance Data
No ratings yet
Doosan Crawler Excavator DX300 Performance Data
1 page
A-Level Ap1 Paper 2
No ratings yet
A-Level Ap1 Paper 2
13 pages
The Portrait of A Lady
No ratings yet
The Portrait of A Lady
6 pages
Sexting for the Modern Man_ How to Seduce Women Through the -- Jones, Jack -- 2016
No ratings yet
Sexting for the Modern Man_ How to Seduce Women Through the -- Jones, Jack -- 2016
83 pages
CSO Olympiad Book For Class 10
No ratings yet
CSO Olympiad Book For Class 10
11 pages
Madeena Charter Co-Existence and Diversi PDF
No ratings yet
Madeena Charter Co-Existence and Diversi PDF
2 pages
Guimalan Research Paper
No ratings yet
Guimalan Research Paper
11 pages
Plato The Timaeus
100% (1)
Plato The Timaeus
243 pages
List of Peer Advisers - Intramuros
No ratings yet
List of Peer Advisers - Intramuros
4 pages
Bipin Prasad: Professional
No ratings yet
Bipin Prasad: Professional
1 page
S P R I N G: Official Solutions
No ratings yet
S P R I N G: Official Solutions
19 pages
B PCM Mains - 12 (Incoming) Solution 31-03-2024
No ratings yet
B PCM Mains - 12 (Incoming) Solution 31-03-2024
18 pages
Presenting Information in Sound - S. Bly
No ratings yet
Presenting Information in Sound - S. Bly
5 pages
Concession Tasks, Ancient Civ Bac Sentences Plan It & Teach
No ratings yet
Concession Tasks, Ancient Civ Bac Sentences Plan It & Teach
1 page
Guide 1812 01 Start
No ratings yet
Guide 1812 01 Start
1 page

Lecture 4

Uploaded by

Lecture 4

Uploaded by

Language Model

• Guess the next word...

• We can formalize this task using what are called

P(the | its water is so transparent that) =

• Recall the definition of conditional probabilities

P(its water was so transparent)=

• There are still a lot of possible sentences

• This particular kind of independence assumption is

So for each component in the product replace with

• The Maximum Likelihood Estimate (MLE)

• can you tell me about any good cantonese restaurants

• P(<s> I want english food </s>) =

You might also like