Statistical Inference
Statistical Inference
Statistical Estimators
Similarity Measures
D1 🡪 {w1,w2,w3,w4}
D2🡪 {w1,w4,w5}
D3🡪 {w5,w6,w7}
“Shannon Game” (Shannon, 1951)
✔ “I am going to make a collect …”
✔ Predict the next word given the n-1 previous words.
✔ Past behavior is a good guide to what will happen in the future as there is
regularity in language.
✔ Determine the probability of different sequences from a training corpus.
Language Modeling
✔ A statistical model of word/character sequences
✔ Used to predict the next character/word given the previous ones
Applications:
✔ Speech recognition
✔ Optical character recognition / Handwriting recognition
✔ Statistical Machine Translation
✔ Spelling correction
✔ He is trying to fine out.
✔ Hopefully, all with continue smoothly in my absence
✔…
1st approximation
✔Each word has an equal probability to follow any other
✔with 1,00,000 words, the probability of each of them at any given point is .00001
✔But some words are more frequent than others…
✔in Brown corpus:
“the” appears 69,971 times
“rabbit” appears 11 times
Frequency of frequencies
N-grams
✔Take into account the frequency of the word in some training corpus
✔at any given point, “the” is more probable than “rabbit”
✔but bag of word approach…
✔“Just then, the white …”
✔So the probability of a word also depends on the previous words (the history)
P(wn
|w1w2…wn-1)
Problems with n-grams
✔“the large green ______ .”
✔“mountain”? “tree”?
✔“Sue swallowed the large green ______ .”
✔“pill”? “broccoli”?
✔Knowing that Sue “swallowed” helps narrow down possibilities
✔But, how far back do we look?
Bins: Forming Equivalence Classes
Reliability vs. Discrimination
✔ larger n:
• more information about the context of the specific instance
• greater discrimination
• But:
• too consuming
• ex: for a vocabulary of 20,000 words:
• number of bigrams = 400 million (20 0002)
• number of trigrams = 8 trillion (20 0003)
• number of four-grams = 1.6 x 1017 (20 0004)
• too many chances that the history has never been seen before (data sparseness)
✔ smaller n:
• less precision
• But:
• more instances in training data, better statistical estimates
• more reliability
--> Markov approximation: take only the most recent history
Markov (Independence) assumption
✔ Markov Assumption:
• we can predict the probability of some future item on the basis of a short history
Example 1: P(event)
✔ In a training corpus, we have 10 instances of “come across”
• 8 times, followed by “as”
• 1 time, followed by “more”
• 1 time, followed by “a”
✔ With MLE, we have:
• P(as | come across) = 0.8
• P(more | come across) = 0.1
• P(a | come across) = 0.1
• P(X | come across) = 0 where X ≠ “as”, “more”, “a”
Example 2: P(sequence of events)
PROBLEM(2)
Natural language processing is an interesting subject. Three faculty members are handling Natural
language processing. Students are speaking natural language in the campus.
Some adjustments
✔ product of probabilities… numerical underflow for long sentences
✔ So instead of multiplying the probs, we add the log of the probs
✔ in Shakespeare’s work
– out of 844 000 possible bigrams
– 99.96% were not used
✔ Solution: smoothing
– decrease the probability of previously seen events
– so that there is a little bit of probability mass left over for previously unseen events
– also called discounting
Discounting or Smoothing
✔ MLE is usually unsuitable for NLP because of the sparseness of the data
✔ We need to allow for possibility of seeing events not seen in training
✔ Must use a Discounting or Smoothing technique
✔ Decrease the probability of previously seen events to leave a little bit of
probability for previously unseen events
Many smoothing techniques
• Add-one
• Add-delta
• Witten-Bell smoothing
• Good-Turing smoothing
• Church-Gale smoothing
• Absolute-discounting
• Kneser-Ney smoothing
• ...
Add-one Smoothing (Laplace’s law)
✔ Pretend we have seen every n-gram at least once
✔ Intuitively:
• new_count(n-gram) = old_count(n-gram) + 1
✔ The idea is to give a little bit of the probability space to unseen
events
Add-one: Example
Add-one, more formally
Add-one
Freq. from smoothed freq.
fMLE fempirical fadd-one
training data
0 0.000027 0.000295
Freq. from 1 0.448 0.000274
held-out data
too high
2 1.25 0.000411
3 2.24 0.000548
4 3.23 0.000685
too low
5 4.21 0.000822
Question : Find the probability of the sentence : " + I am Sam green -"
Validation / Held-out Estimation
✔ How do we know how much of the probability space to “hold out” for unseen events?
✔ ie. We need a good way to guess λ in advance
✔ Held-out data:
• We can divide the training data into two parts:
• the training set: used to build initial estimates by counting
• the held out data: used to refine the initial estimates (i.e. see how often the bigrams that
appeared r times in the training text occur in the held-out text)
Held Out Estimation
✔ For each n-gram w1...wn we compute:
• Ctr(w1...wn) the frequency of w1...wn in the training data
• Cho(w1...wn) the frequency of w1...wn in the held out data
✔ Let:
• r = the frequency of an n-gram in the training data
• Nr = the number of different n-grams with frequency r in the training data
• Tr = the sum of the counts of all n-grams in the held-out data that appeared r times in the
training data
• T = total number of n-gram in the held out data
✔ So:
Problem
Bigrams
•
• Possible Bigrams: AA, AB, BA, AC, CA, BB, BC, CB, CC
• Bigrams (Training data): AB, BC, CA, AB, BA, AA
• Bigrams (Heldout data): AB, BC, CA, AC
Seen Bigrams (Training data): AB, BC, CA, BA, AA
Unseen Bigrams (Training data): AC, BB, CB, CC
r Bigrams Nr Tr
2 AB 1 1+0+0+0=1
r Bigrams Nr Tr
2 SU 1 1
■ ex: assume
❑ if r=5 and 10 different n-grams (types) occur 5 times in training
❑ --> N = 10
5
❑ if all the n-grams (types) that occurred 5 times in training, occurred in total (n-gram
tokens) 20 times in the held-out data
❑ --> T = 20
5
❑ assume the held-out data contains 2000 n-grams (tokens)
Dividing the corpus
✔ Training:
• Training data (80% of total data)
• To build initial estimates (frequency counts)
• Held out data (10% of total data)
• To refine initial estimates (smoothed estimates)
✔ Testing:
• Development test data (5% of total data)
• To test while developing
• Final test data (5% of total data)
• To test at the end
✔ But how do we divide?
• Randomly select data (ex. sentences, n-grams)
• Advantage: Test data is very similar to training data
• Cut large chunks of consecutive data
• Advantage: Results are lower, but more realistic
Pots of Data for Developing and Testing Models
•Training data (80% of total data)
•Held Out data (10% of total data)
•Development Data (5% of total data)
•Test Data (5% of total data)
•Write an algorithm, train it, test it, note things it does wrong,
revise it and repeat many times.
•Keep development test data and final test data as development
data is “seen” by the system during repeated testing.
•Only then, evaluate and publish results
•Give final results by testing on n smaller samples of the test data
and averaging.
Good-Turing Estimator
✔ Based on the assumption that words have a binomial distribution
✔ Works well in practice (with large corpora)
Idea:
• Re-estimate the probability mass of n-grams with zero (or low) counts by
looking at the number of n-grams with higher counts
• Ex:
No. of ngrams that occur c+1 times
No. of ngrams that occur c times
✔ If c > k (usually k = 5)
✔ c* = c
✔ If c <= k
Problem
Sam I am I am Sam I do not eat
N3 = 1, N2 = 2, N1 = 3
•Unigram:
•I – 3
• Sam – 2
• am – 2
• do – 1
• not – 1
• eat – 1
• N =10
Good-Turing Estimator
N10 =1, N3 = 1, N2 = 1, N1 = 3
Good-Turing Estimator
• "SASTRA UNIVERSITY GOOD SASTRA UNIVERSITY GOOD SASTRA
MANAGEMENT SASTRA UNIVERSITY ". Apply good turing estimation method in
this corpus to find the probability of the sentence "SASTRA MANAGEMENT".
• Bigrams: SU, UG, GS, SU, UG, GS, SM, MS, SU
• C=1, SM, MS N1=2
• C=2, UG, GS N2=2
• C=3, SU N3=1
Good-Turing Estimator
• Corpus: ABCABCADAB N=10
• P(AA)?, P(AD)? Use Good turing estimator
• Seen Bigrams:
• AB, BC, CA, AB, BC, CA, AD, DA, AB
• Unseen Bigrams:
• AA, AC, BA, BB, CB, CC, BD, CD, DD, DB, DC
• C=0, AA, AC, BA, BB, CB, CC, BD, CD, DD, DB, DC N0=11
• C=1, AD, DA N1=2
• C=2, BC,CA N2=2
• C=3, AB N3=1