0% found this document useful (0 votes)

11 views38 pages

Statistical Inference

Uploaded by

225003012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views38 pages

Statistical Inference

Uploaded by

225003012

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

Statistical Inference

Statistical Estimators
Similarity Measures
D1 🡪 {w1,w2,w3,w4}
D2🡪 {w1,w4,w5}
D3🡪 {w5,w6,w7}
“Shannon Game” (Shannon, 1951)
✔ “I am going to make a collect …”
✔ Predict the next word given the n-1 previous words.
✔ Past behavior is a good guide to what will happen in the future as there is
regularity in language.
✔ Determine the probability of different sequences from a training corpus.

Language Modeling
✔ A statistical model of word/character sequences
✔ Used to predict the next character/word given the previous ones
Applications:
✔ Speech recognition
✔ Optical character recognition / Handwriting recognition
✔ Statistical Machine Translation
✔ Spelling correction
✔ He is trying to fine out.
✔ Hopefully, all with continue smoothly in my absence
✔…
1st approximation
✔Each word has an equal probability to follow any other
✔with 1,00,000 words, the probability of each of them at any given point is .00001
✔But some words are more frequent than others…
✔in Brown corpus:
“the” appears 69,971 times
“rabbit” appears 11 times
Frequency of frequencies
N-grams
✔Take into account the frequency of the word in some training corpus
✔at any given point, “the” is more probable than “rabbit”
✔but bag of word approach…
✔“Just then, the white …”
✔So the probability of a word also depends on the previous words (the history)
P(wn
|w1w2…wn-1)
Problems with n-grams
✔“the large green ______ .”
✔“mountain”? “tree”?
✔“Sue swallowed the large green ______ .”
✔“pill”? “broccoli”?
✔Knowing that Sue “swallowed” helps narrow down possibilities
✔But, how far back do we look?
Bins: Forming Equivalence Classes
Reliability vs. Discrimination
✔ larger n:
• more information about the context of the specific instance
• greater discrimination
• But:
• too consuming
• ex: for a vocabulary of 20,000 words:
• number of bigrams = 400 million (20 0002)
• number of trigrams = 8 trillion (20 0003)
• number of four-grams = 1.6 x 1017 (20 0004)
• too many chances that the history has never been seen before (data sparseness)
✔ smaller n:
• less precision
• But:
• more instances in training data, better statistical estimates
• more reliability
--> Markov approximation: take only the most recent history
Markov (Independence) assumption

✔ Markov Assumption:
• we can predict the probability of some future item on the basis of a short history

• if (history = last n-1 words)

--> (n-1)th order Markov model
(or)
--> n-gram model

✔ Most widely used:

• unigram (n=1)
• bigram (n=2)
• trigram (n=3)
Bigrams
✔ First-order Markov models
P(wn|wn-1)

✔ N-by-N matrix of probabilities/frequencies

✔ N = size of the vocabulary we are modeling
Why use only bi- or tri-grams?
✔ Markov approximation is still costly with a 20000 word
vocabulary:
• bigram needs to store 400 million parameters
• trigram needs to store 8 trillion parameters
• using a language model > trigram is impractical

✔ To reduce the number of parameters, we can:

• do stemming (use stems instead of word types)
• Stemming is basically removing the suffix from a word and reduce
it to its root word
• group words into semantic classes
• seen once --> same as unseen
• ...
Building n-gram Models
✔ Data preparation:
1. Decide training corpus
2. Clean and tokenize
3. How do we deal with sentence boundaries?

✔ Use statistical estimators:

• To derive a good probability estimates based on training data.
Statistical Estimators
✔Maximum Likelihood Estimation (MLE)
✔Smoothing
• Add-one -- Laplace
• Add-delta -- Lidstone’s & Jeffreys-Perks’ Laws (ELE)
• Validation:
Held Out Estimation
Cross Validation
• Witten-Bell smoothing
• Good-Turing smoothing
✔Combining Estimators
• Simple Linear Interpolation
• General Linear Interpolation
• Katz’s Backoff
Maximum Likelihood Estimation
✔Choose the parameter values which gives the highest probability on the training corpus

✔Let C(w1,..,wn) be the frequency of n-gram w1,..,wn

Example 1: P(event)
✔ In a training corpus, we have 10 instances of “come across”
• 8 times, followed by “as”
• 1 time, followed by “more”
• 1 time, followed by “a”
✔ With MLE, we have:
• P(as | come across) = 0.8
• P(more | come across) = 0.1
• P(a | come across) = 0.1
• P(X | come across) = 0 where X ≠ “as”, “more”, “a”
Example 2: P(sequence of events)

P(I want to eat British food)

= P(I|<s>) x P(want|I) x P(to|want) x P(eat|to) x P(British|eat) x P(food|British)
= .25 x .32 x .65 x .26 x .001 x .6
= .000008
Example
PROBLEM (1)
SASTRA UNIVERSITY is a well known institution.
SASTRA UNIVERSITY is located at Thanjavur.
SASTRA UNIVERSITY is a category one institution.
SASTRA management is focussing on quality education and placement.
B1 B2 B3
1. SASTRA UNIVERSITY 2. SASTRA MANAGEMENT 3. SASTRA INSTITUITION

PROBLEM(2)
Natural language processing is an interesting subject. Three faculty members are handling Natural
language processing. Students are speaking natural language in the campus.
Some adjustments
✔ product of probabilities… numerical underflow for long sentences
✔ So instead of multiplying the probs, we add the log of the probs

P(I want to eat British food)

= log(P(want|I)) + log(P(to|want)) + log(P(eat|to)) + log(P(British|eat)) +
log(P(food|British))
= log(.32) + log(.65) + log (.26) + log(.001) + log(.6)
= ???
Problem with MLE: data sparseness
✔ What if a sequence never appears in training corpus? P(X)=0
• “come across the men” --> prob = 0
• “come across some men” --> prob = 0
• “come across 3 men” --> prob = 0

✔ MLE assigns a probability of zero to unseen events …

✔ probability of an n-gram involving unseen words will be zero!

✔ but… most words are rare (Zipf’s Law).

✔ so n-grams involving rare words are even more rare… data
sparseness
Problem with MLE: data sparseness
✔ in (Balh et al 83)
– training with 1.5 million words
– 23% of the trigrams from another part of the same corpus were previously unseen.

✔ in Shakespeare’s work
– out of 844 000 possible bigrams
– 99.96% were not used

✔ So MLE alone is not good enough estimator

✔ Solution: smoothing
– decrease the probability of previously seen events
– so that there is a little bit of probability mass left over for previously unseen events
– also called discounting
Discounting or Smoothing
✔ MLE is usually unsuitable for NLP because of the sparseness of the data
✔ We need to allow for possibility of seeing events not seen in training
✔ Must use a Discounting or Smoothing technique
✔ Decrease the probability of previously seen events to leave a little bit of
probability for previously unseen events
Many smoothing techniques
• Add-one
• Add-delta
• Witten-Bell smoothing
• Good-Turing smoothing
• Church-Gale smoothing
• Absolute-discounting
• Kneser-Ney smoothing
• ...
Add-one Smoothing (Laplace’s law)
✔ Pretend we have seen every n-gram at least once
✔ Intuitively:
• new_count(n-gram) = old_count(n-gram) + 1
✔ The idea is to give a little bit of the probability space to unseen
events
Add-one: Example
Add-one, more formally

N: number of n-grams in training corpus starting with w1…wn-1

V: size of vocabulary
The example
MLE Example
+ I am Sam -
+ Sam I am -
+ I do not like green eggs -
Question : Find the probability of the sentence :
" + I am Sam green -"
LAPLACE LAW
+I am Sam -
N: number of n-grams in training corpus starting with w1…wn-1
+ Sam I am -
V: size of vocabulary
+ I do not like green eggs -

Question : Find the probability of the sentence :

" + I am Sam green -“
Problem with add-one smoothing
✔ every previously unseen n-gram is given a low probability
✔ but there are so many of them that too much probability mass is given to unseen
events
✔ adding 1 to frequent bigram, does not change much
✔ but adding 1 to low bigrams (including unseen ones) boosts them too much !
✔ In NLP applications that are very sparse, Laplace’s Law actually gives far too much of
the probability space to unseen events.
Problem with add-one smoothing
Data from the AP from (Church and Gale, 1991)
• Corpus of 22,000,000 bigrams
• Vocabulary of 273,266 words (i.e. 74,674,306,760 possible bigrams -
or bins)
• 74,671,100,000 bigrams were unseen
• And each unseen bigram was given a frequency of 0.000295

Add-one
Freq. from smoothed freq.
fMLE fempirical fadd-one
training data
0 0.000027 0.000295
Freq. from 1 0.448 0.000274
held-out data
too high
2 1.25 0.000411
3 2.24 0.000548
4 3.23 0.000685
too low
5 4.21 0.000822

■ Total probability mass given to unseen bigrams =

(74,671,100,000 x 0.000295) / 22,000,000 ~99.96 !!!!
Add-delta smoothing (Lidstone’s law)
• instead of adding 1, add some other (smaller) positive value λ

• most widely used value for λ = 0.5

• if λ=0.5, Lidstone’s Law is called:
• the Expected Likelihood Estimation (ELE)

• better than add-one.

• or the Jeffreys-Perks Law
Lidstone’s Law
+ I am Sam -
+ Sam I am -
+ I do not like green eggs -
Question : Find the probability of the sentence : " + I am Sam green -"
Jeffrey’s Law
+ I am Sam -
+ Sam I am -
+ I do not like green eggs -

Question : Find the probability of the sentence : " + I am Sam green -"
Validation / Held-out Estimation
✔ How do we know how much of the probability space to “hold out” for unseen events?
✔ ie. We need a good way to guess λ in advance
✔ Held-out data:
• We can divide the training data into two parts:
• the training set: used to build initial estimates by counting
• the held out data: used to refine the initial estimates (i.e. see how often the bigrams that
appeared r times in the training text occur in the held-out text)
Held Out Estimation
✔ For each n-gram w1...wn we compute:
• Ctr(w1...wn) the frequency of w1...wn in the training data
• Cho(w1...wn) the frequency of w1...wn in the held out data
✔ Let:
• r = the frequency of an n-gram in the training data
• Nr = the number of different n-grams with frequency r in the training data
• Tr = the sum of the counts of all n-grams in the held-out data that appeared r times in the
training data
• T = total number of n-gram in the held out data
✔ So:
Problem
Bigrams
•
• Possible Bigrams: AA, AB, BA, AC, CA, BB, BC, CB, CC
• Bigrams (Training data): AB, BC, CA, AB, BA, AA
• Bigrams (Heldout data): AB, BC, CA, AC
Seen Bigrams (Training data): AB, BC, CA, BA, AA
Unseen Bigrams (Training data): AC, BB, CB, CC

r Bigrams Nr Tr

2 AB 1 1+0+0+0=1

Bigrams (Heldout data): AB, BC, CA, AC

1 BC, CA, BA, AA 4 1+1+0+0=2

0 AC, BB, CB, CC 4 1+0+0+0=1

Problem

Bigrams (Training data): SU, UG, GS, SU, UL, LT

Bigrams (Heldout data): SU, UT, TP, PG
T=4

r Bigrams Nr Tr

2 SU 1 1

1 UG, GS, UL, LT 4 0

Some explanation…

probability in held-out since we have Nr different n-grams in

data for all n-grams the training data that occurred r
appearing r times in the times, let's share this probability
training data mass equality among them

■ ex: assume
❑ if r=5 and 10 different n-grams (types) occur 5 times in training
❑ --> N = 10
5
❑ if all the n-grams (types) that occurred 5 times in training, occurred in total (n-gram
tokens) 20 times in the held-out data
❑ --> T = 20
5
❑ assume the held-out data contains 2000 n-grams (tokens)
Dividing the corpus
✔ Training:
• Training data (80% of total data)
• To build initial estimates (frequency counts)
• Held out data (10% of total data)
• To refine initial estimates (smoothed estimates)
✔ Testing:
• Development test data (5% of total data)
• To test while developing
• Final test data (5% of total data)
• To test at the end
✔ But how do we divide?
• Randomly select data (ex. sentences, n-grams)
• Advantage: Test data is very similar to training data
• Cut large chunks of consecutive data
• Advantage: Results are lower, but more realistic
Pots of Data for Developing and Testing Models
•Training data (80% of total data)
•Held Out data (10% of total data)
•Development Data (5% of total data)
•Test Data (5% of total data)
•Write an algorithm, train it, test it, note things it does wrong,
revise it and repeat many times.
•Keep development test data and final test data as development
data is “seen” by the system during repeated testing.
•Only then, evaluate and publish results
•Give final results by testing on n smaller samples of the test data
and averaging.
Good-Turing Estimator
✔ Based on the assumption that words have a binomial distribution
✔ Works well in practice (with large corpora)
Idea:
• Re-estimate the probability mass of n-grams with zero (or low) counts by
looking at the number of n-grams with higher counts
• Ex:
No. of ngrams that occur c+1 times
No. of ngrams that occur c times

✔ In practice c* is not used for all counts c

✔ large counts (> a threshold k) are assumed to be reliable

✔ If c > k (usually k = 5)
✔ c* = c
✔ If c <= k
Problem
Sam I am I am Sam I do not eat

N3 = 1, N2 = 2, N1 = 3
•Unigram:
•I – 3
• Sam – 2
• am – 2
• do – 1
• not – 1
• eat – 1
• N =10
Good-Turing Estimator

N10 =1, N3 = 1, N2 = 1, N1 = 3
Good-Turing Estimator
• "SASTRA UNIVERSITY GOOD SASTRA UNIVERSITY GOOD SASTRA
MANAGEMENT SASTRA UNIVERSITY ". Apply good turing estimation method in
this corpus to find the probability of the sentence "SASTRA MANAGEMENT".
• Bigrams: SU, UG, GS, SU, UG, GS, SM, MS, SU
• C=1, SM, MS N1=2
• C=2, UG, GS N2=2
• C=3, SU N3=1
Good-Turing Estimator
• Corpus: ABCABCADAB N=10
• P(AA)?, P(AD)? Use Good turing estimator
• Seen Bigrams:
• AB, BC, CA, AB, BC, CA, AD, DA, AB
• Unseen Bigrams:
• AA, AC, BA, BB, CB, CC, BD, CD, DD, DB, DC
• C=0, AA, AC, BA, BB, CB, CC, BD, CD, DD, DB, DC N0=11
• C=1, AD, DA N1=2
• C=2, BC,CA N2=2
• C=3, AB N3=1

Unit Ii - NLP
No ratings yet
Unit Ii - NLP
35 pages
(Yale Language Series) Eleanor Harz Jorden, Mari Noda-Japanese - The Spoken Language, Part 1-Yale University Press (1987)
86% (7)
(Yale Language Series) Eleanor Harz Jorden, Mari Noda-Japanese - The Spoken Language, Part 1-Yale University Press (1987)
357 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
Mary Jane Lesson Plan in English VI
100% (1)
Mary Jane Lesson Plan in English VI
27 pages
Lecture 5: Language Modeling (N-Gram, BOW)
No ratings yet
Lecture 5: Language Modeling (N-Gram, BOW)
25 pages
Coast Saring MS
100% (1)
Coast Saring MS
71 pages
Ngrams
100% (1)
Ngrams
22 pages
NLP PLM
No ratings yet
NLP PLM
35 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
3 LM 2024
No ratings yet
3 LM 2024
78 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
15 pages
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
No ratings yet
N-Grams and Smoothing: Course Based On Jurafsky and Martin (2009, Chap.4)
36 pages
NLP CH 2
No ratings yet
NLP CH 2
59 pages
NLP - N-Gram Language Model
No ratings yet
NLP - N-Gram Language Model
22 pages
Lecture 4
No ratings yet
Lecture 4
87 pages
NLP m2
No ratings yet
NLP m2
74 pages
Introduction To Language Modeling Final
No ratings yet
Introduction To Language Modeling Final
69 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
3 LM Jan 08 2021
No ratings yet
3 LM Jan 08 2021
77 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Chapter-02 2
No ratings yet
Chapter-02 2
42 pages
NLP Unit2
No ratings yet
NLP Unit2
65 pages
Evaluating Language Models
No ratings yet
Evaluating Language Models
21 pages
Imagery Examples
No ratings yet
Imagery Examples
3 pages
NLP Lec 05
No ratings yet
NLP Lec 05
18 pages
N Grams
No ratings yet
N Grams
51 pages
Lecture 4 N Grams
No ratings yet
Lecture 4 N Grams
29 pages
Lec-3 Language Modeling N-Grams
No ratings yet
Lec-3 Language Modeling N-Grams
41 pages
Language Modeling
No ratings yet
Language Modeling
43 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
Natural Language Processing - Notes - Unit 2
No ratings yet
Natural Language Processing - Notes - Unit 2
19 pages
Ngram
No ratings yet
Ngram
41 pages
Language Modelling
No ratings yet
Language Modelling
17 pages
N-Grams - Text Representation
No ratings yet
N-Grams - Text Representation
23 pages
Module 2
No ratings yet
Module 2
98 pages
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
No ratings yet
Language Modeling: Prabhleen Juneja Thapar Institute of Engineering & Technology
36 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Arabian Egl
No ratings yet
Arabian Egl
2 pages
Multimedia Application L5
No ratings yet
Multimedia Application L5
35 pages
NLP Unit-4
No ratings yet
NLP Unit-4
48 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
CS 388: Natural Language Processing:: N-Gram Language Models
No ratings yet
CS 388: Natural Language Processing:: N-Gram Language Models
22 pages
Multimedia Application L6
No ratings yet
Multimedia Application L6
63 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
Language Modeling
No ratings yet
Language Modeling
50 pages
NLP
No ratings yet
NLP
12 pages
Lecture13 LM YirenWang
No ratings yet
Lecture13 LM YirenWang
8 pages
Probabilistic Language Modeling Challenges
No ratings yet
Probabilistic Language Modeling Challenges
12 pages
NLP 5th Unit
No ratings yet
NLP 5th Unit
19 pages
5) Lecture Feb11&13&17&18
No ratings yet
5) Lecture Feb11&13&17&18
21 pages
Natural Language Processing
No ratings yet
Natural Language Processing
28 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Analysis of Statistical Parsing in Natural Language Processing
No ratings yet
Analysis of Statistical Parsing in Natural Language Processing
6 pages
Module 2
No ratings yet
Module 2
26 pages
N Grams
No ratings yet
N Grams
3 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
IJISRT18DC138
No ratings yet
IJISRT18DC138
6 pages
April 22 Part 2achine Translation
No ratings yet
April 22 Part 2achine Translation
36 pages
Remote File Inclusion
0% (1)
Remote File Inclusion
7 pages
NLP Lecture 8 Week 4
No ratings yet
NLP Lecture 8 Week 4
10 pages
CS 904: Natural Language Processing Statistical Inference: N-Grams
No ratings yet
CS 904: Natural Language Processing Statistical Inference: N-Grams
30 pages
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
No ratings yet
Corpus (Pl. Corpora) A Computer-Readable Collection Of: Introduction To NLP
3 pages
Unit 2
No ratings yet
Unit 2
7 pages
DNS Security Guide
No ratings yet
DNS Security Guide
73 pages
Essay Topics Grade Ten
No ratings yet
Essay Topics Grade Ten
8 pages
Indian Weavers: Notes
100% (1)
Indian Weavers: Notes
12 pages
Sorting in ALV Using CL - SALV - TABLE - SAP Fiori, SAP HANA, SAPUI5, SAP Netweaver Gateway Tutorials, Interview Questions - SAP Learners
100% (1)
Sorting in ALV Using CL - SALV - TABLE - SAP Fiori, SAP HANA, SAPUI5, SAP Netweaver Gateway Tutorials, Interview Questions - SAP Learners
4 pages
Markov Models
No ratings yet
Markov Models
54 pages
Text:: Comprehension
No ratings yet
Text:: Comprehension
5 pages
AIGDEL - 0820 Red 1 26 - Compressed 1 26
No ratings yet
AIGDEL - 0820 Red 1 26 - Compressed 1 26
26 pages
Plants Poem Analysis
67% (3)
Plants Poem Analysis
2 pages
Petrarch Secret
No ratings yet
Petrarch Secret
223 pages
Dmbs U1
No ratings yet
Dmbs U1
12 pages
Williams PDF
No ratings yet
Williams PDF
14 pages
Activity 6,7,8,9
No ratings yet
Activity 6,7,8,9
4 pages
Friedman OverviewSpinozasEthics 1978
No ratings yet
Friedman OverviewSpinozasEthics 1978
41 pages
Spagobi Server Configure v3
No ratings yet
Spagobi Server Configure v3
11 pages
Parts of Speech and Morphology - Phrase Structure - Semantics and Pragmatics
No ratings yet
Parts of Speech and Morphology - Phrase Structure - Semantics and Pragmatics
39 pages
Quiz Quantifiers
No ratings yet
Quiz Quantifiers
3 pages
BDHI4103
No ratings yet
BDHI4103
8 pages
10.1016@s0010 44850100100 2
No ratings yet
10.1016@s0010 44850100100 2
13 pages
Kajian Sikap Dan Persepsi Terhadap Pembelajaran Bahasa Mandarin Dalam Kalangan Pelajar Uitm Kelantan
No ratings yet
Kajian Sikap Dan Persepsi Terhadap Pembelajaran Bahasa Mandarin Dalam Kalangan Pelajar Uitm Kelantan
16 pages
Piyush Parate Resume
No ratings yet
Piyush Parate Resume
2 pages
Anthem Socratic Seminar 2019 Assignment
No ratings yet
Anthem Socratic Seminar 2019 Assignment
2 pages
Cca 2ND Term J1
No ratings yet
Cca 2ND Term J1
2 pages
ML Supervised Learning
No ratings yet
ML Supervised Learning
44 pages
Ooad-July 2022
No ratings yet
Ooad-July 2022
4 pages
Tourist Attractions in Roxas
No ratings yet
Tourist Attractions in Roxas
10 pages
Cse402 May 2023
No ratings yet
Cse402 May 2023
4 pages
Product Research PDF. Luke Belmar
No ratings yet
Product Research PDF. Luke Belmar
1 page
Reference To Above PDF
No ratings yet
Reference To Above PDF
2 pages
Exam Selectividad Alcohol
No ratings yet
Exam Selectividad Alcohol
1 page
What Happy Couples Know Part 1
No ratings yet
What Happy Couples Know Part 1
1 page
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
No ratings yet
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
1 page
Ca3 Es-Cs-201 Cse 2nd Semester
No ratings yet
Ca3 Es-Cs-201 Cse 2nd Semester
1 page
Sampling in Statistics
From Everand
Sampling in Statistics
Stephanie Glen
No ratings yet

Statistical Inference

Uploaded by

Statistical Inference

Uploaded by

Statistical Inference

• if (history = last n-1 words)

✔ Most widely used:

✔ N-by-N matrix of probabilities/frequencies

✔ To reduce the number of parameters, we can:

✔ Use statistical estimators:

✔Let C(w1,..,wn) be the frequency of n-gram w1,..,wn

P(I want to eat British food)

P(I want to eat British food)

✔ MLE assigns a probability of zero to unseen events …

✔ but… most words are rare (Zipf’s Law).

✔ So MLE alone is not good enough estimator

N: number of n-grams in training corpus starting with w1…wn-1

Question : Find the probability of the sentence :

■ Total probability mass given to unseen bigrams =

• most widely used value for λ = 0.5

• better than add-one.

Bigrams (Heldout data): AB, BC, CA, AC

0 AC, BB, CB, CC 4 1+0+0+0=1

Bigrams (Training data): SU, UG, GS, SU, UL, LT

1 UG, GS, UL, LT 4 0

probability in held-out since we have Nr different n-grams in

✔ In practice c* is not used for all counts c

You might also like