L3 LanguageModels
L3 LanguageModels
Introduction to NLP
123.
Probabilities
Probabilities in NLP
• Very important for language processing
– “Let’s meet in the conference …”
• Example in speech recognition:
– “recognize speech” vs “wreck a nice beach”
• Example in machine translation:
– “l’avocat general”: “the attorney general” vs. “the general avocado”
• Example in information retrieval:
– If a document includes three occurrences of “stir” and one of “rice”, what is the
probability that it is a recipe
• Probabilities make it possible to combine evidence from multiple
sources in a systematic way
Probabilities
• Probability theory
– predicting how likely it is that something will happen
• Experiment (trial)
– e.g., throwing a coin
• Possible outcomes
– heads or tails
• Sample spaces
– discrete (number of occurrences of “rice”) or continuous (e.g., temperature)
• Events
– is the certain event
– is the impossible event
– event space - all possible events
Sample Space
• Random experiment: an experiment with uncertain outcome
– e.g., flipping a coin, picking a word from text
• Sample space: all possible outcomes, e.g.,
– Tossing 2 fair coins, ={HH, HT, TH, TT}
Events
• Event: a subspace of the sample space
– E , E happens iff outcome is in E, e.g.,
• E={HH} (all heads)
• E={HH,TT} (same face)
– Impossible event ()
– Certain event ()
• Probability of Event : 0 P(E) ≤1, s.t.
– P()=1 (outcome always in )
– P(A B)=P(A)+P(B), if (AB)= (e.g., A=same face, B=different
face)
Example: Tossing a Die
• Sample space: = {1,2,3,4,5,6}
• Fair die:
– p(1) = p(2) = p(3) = p(4) = p(5) = p(6) = 1/6
• Unfair die example: p(1) = 0.3, p(2) = 0.2, ...
• N-dimensional die:
– = {1, 2, 3, 4, …, N}
• Example in modeling text:
– Toss a die to decide which word to write in the next position
– = {cat, dog, tiger, …}
Example: Flipping a Coin
• : {Head, Tail}
• Fair coin:
– p(H) = 0.5, p(T) = 0.5
• Unfair coin, e.g.:
– p(H) = 0.3, p(T) = 0.7
• Flipping two fair coins:
– Sample space: {HH, HT, TH, TT}
• Example in modeling text:
– Flip a coin to decide whether or not to include a word in a document
– Sample space = {appear, absence}
Probabilities
• Probabilities
– numbers between 0 and 1
• Probability distribution
– distributes a probability mass of 1 throughout the sample space .
• Example:
– A fair coin is tossed three times.
– What is the probability of 3 heads?
– What is the probability of 2 heads?
Meaning of probabilities
• Frequentist
– I threw the coin 10 times and it turned up heads 5 times
• Subjective
– I am willing to bet 50 cents on heads
Probabilities
• Joint probability: P(AB), also written as P(A, B)
• Conditional Probability:
P(B|A) = P(AB)/P(A)
P(AB) = P(A)P(B|A) = P(B)P(A|B)
P(A|B) = P(B|A)P(A)/P(B) (Bayes’ Rule)
For independent events, P(AB) = P(A)P(B), so P(A|B)=P(A)
• Total probability:
If A1, …, An form a partition of S, then
P(B) = P(BS) = P(B, A1) + … + P(B, An) (why?)
So, P(Ai|B) = P(B|Ai)P(Ai)/P(B) = P(B|Ai)P(Ai)/[P(B|A1)P(A1)+…+P(B|An)P(An)
This allows us to compute P(Ai|B) based on P(B|Ai)
Conditional Probability
• Prior and posterior probability
• Conditional probability P(A B)
P(A|B) =
P(B)
A B
AB
Properties of Probabilities
• p() = 0
• P(certain event)=1
• p(X) p(Y), if X Y
• p(X Y) = p(X) + p(Y), if X Y=
Conditional Probability
125.
Bayes Theorem
Bayes Theorem
• Formula for joint probability
p(A,B) = p(B|A)p(A)
p(A,B) = p(A|B)p(B)
• Therefore
p(B|A) = p(A|B)p(B)/p(A)
• Bayes’ theorem is used to calculate P(A|B) given
P(B|A)
Example
• Diagnostic test
• Test accuracy
p(positive | disease) = 0.05 – false positive
p(negative | disease) = 0.05 – false negative
So: p(positive | disease) = 1-0.05 = 0.95
Same for p(negative | disease)
In general the rates of false positives and false negatives
can be different
Example
A=TEST
P(A|B) Positive Negative
Yes 0.95 0.05
B=DISEASE
No 0.05 0.95
Example
• What is p(disease | positive)?
– P(disease|positive) = P(positive|disease)*P(disease)/P(positive)
– P(disease|positive) = P(positive| disease)*P(disease)/P(positive)
– P(disease|positive)/P(disease|positive) = ?
211.
Language Models
Probabilistic Language Models
• Assign a probability to a sentence
– P(S) = P(w1,w2,w3,...,wn)
• The sum of the probabilities of all possible sentences must be 1.
• Predicting the next word
– Let’s meet in Times …
– General Electric has lost some market …
• Formula
– P(wn|w1,w2,...,wn-1)
Predicting the Next Word
• What word follows “your”? http://norvig.com/ngrams/count_2w.txt
• Unigram example:
– The word “pizza” appears 700 times in a corpus of 10,000,000 words.
– Therefore the MLE for its probability is P’(“pizza”) = 700/10,000,000 = 0.00007
• Bigram example:
– The word “with” appears 1,000 times in the corpus.
– The phrase “with spinach” appears 6 times
– Therefor the MLE for P’(spinach|with) = 6/1,000 = 0.006
• These estimates may not be good for corpora from other genres
Probability of a Sentence
P(“<S> I will see you on Monday</S>”) =
P(I|<S>)
x P(will|I)
x P(see|will)
x P(you|see)
x P(on|you)
x P(Monday|on)
x P(</S>|Monday)
Probability of a Sentence
Example from Jane Austen
• P(“Elizabeth looked at Darcy”)
• Use maximum likelihood estimates for the n-gram probabilities
– unigram: P(wi)=c(wi)/V
– bigram: P(wi|wi-1) = c(wi-1,wi)/c(wi-1)
• Values
– P(“Elizabeth”) = 474/617091 = .000768120
– P(“looked|Elizabeth”) = 5/474 = .010548523
– P(“at|looked”) = 74/337 = .219584569
– P(“Darcy|at”) = 3/4055 = .000739827
• Bigram probability
– P(“Elizabeth looked at Darcy”) = .000000001316 = 1.3 x 10-9
• Unigram probability
– P(“Elizabeth looked at Darcy”) = 474/617091 * 337/617091 * 4055/617091 * 304/617091 =
.000000000001357 = 1.3 x 10-12
• Unigram:
– generate a word, then generate the next one, until you generate </S>.
W1 W2 W3 … Wn </S>
• Bigram:
– generate <S>, generate a word, then generate the next one based on
the previous one, etc., until you generate </S>.
<S> W1 W2 W3 … Wn </S>
Engineering Trick
• http://www.speech.cs.cmu.edu/SLM_info.html
• http://www.speech.sri.com/projects/srilm/
• https://kheafield.com/code/kenlm/
• http://htk.eng.cam.ac.uk/
NLP
Introduction to NLP
212.
Smoothing and Interpolation
Smoothing
• If the vocabulary size is |V|=1M
– Too many parameters to estimate even a unigram model
– MLE assigns values of 0 to unseen (yet not impossible) data
– Let alone bigram or trigram models
• Smoothing (regularization)
– Reassigning some probability mass to unseen data
Smoothing
• How to model novel words?
– Or novel bigrams?
– Distributing some of the probability mass to allow for novel events
• Add-one (Laplace) smoothing:
– Bigrams: P(wi|wi-1) = (c(wi-1,wi)+1)/(c(wi-1)+V)
– This method reassigns too much probability mass to unseen events
• Possible to do add-k instead of add-one
– Both of these don’t work well in practice
Dealing with Sparse Data
213.
Evaluation of Language Models
Evaluation of LM
• Extrinsic
– Use in an application
• Intrinsic
– Cheaper
– Based on information theory
• Correlate the two for validation purposes
Information Theory
• Information need
H(X) = 0 means that we have all the information that we need
H(X) = 1 means that we need one bit of information, etc.
Entropy Example
• Sample 8-character language: A E I O U F G H
• P(C,V)
Surprise in Syllables
• Example
Polynesian Syllables (cont’d)
Polynesian Syllables (cont’d)
Pointwise Mutual Information
• Measured between two points (not two distributions)
1
Per N
P ( w1w2 ...wN )
Perplexity
• Example:
– A sentence consisting of N equiprobable words: p(wi) = 1/k
1
Per N
P ( w1w2 ...wN )
– Per = ((k-1)N)(-1/N) = k
• Perplexity is like a (weighted) branching factor
• Logarithmic version
– the exponent is = #bits to encode each word
Per 2
(1 / N ) log 2 P ( wi )
The Shannon Game
• Consider the Shannon game:
– Connecticut governor Ned Lamont said ...
• What is the perplexity of guessing a digit if all digits are equally
likely? Do the math.
– 10
• How about a letter?
– 26
214.
The Noisy Channel Model
The Noisy Channel Model
• Example:
– Input: Written English (X)
– Encoder: garbles the input (X->Y)
– Output: Spoken English (Y)
• More examples:
– Grammatical English to English with mistakes
– English to bitmaps (characters)
• P(X,Y) = P(X)P(Y|X)
Encoding and Decoding
• Given f, guess e
f
e EF FE e’
encoder decoder
e’ = argmax
e
P(e|f) = argmax
e
P(f|e) P(e)
P(f|e) P(e)
cat rat piano
house white the
the house white
the red house
the small cat
the white house
Example
P(f|e) P(e)
cat rat piano - -
house white the + -
the house white
the red house
the small cat
the white house
Example
P(f|e) P(e)
cat rat piano - -
house white the + -
the house white + -
the red house - +
the small cat - +
the white house + +
Uses of the Noisy Channel Model
• Handwriting recognition
• Text generation
• Text summarization
• Machine translation
• Spelling correction
– See separate lecture on text similarity and edit distance
Spelling Correction
215.
Part of Speech Tagging
The POS task
• Example
– Bahrainis vote in second round of parliamentary election
• Jabberwocky (by Lewis Carroll, 1872)
`Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.
Penn Treebank tagset (1/2)
Tag Description Example
CC coordinating conjunction and
CD cardinal number 1
DT determiner the
EX existential there there is
FW foreign word d‘oeuvre
IN preposition/subordinating conjunction in, of, like
JJ adjective green
JJR adjective, comparative greener
JJS adjective, superlative greenest
LS list marker 1)
MD modal could, will
NN noun, singular or mass table
NNS noun plural tables
NNP proper noun, singular John
NNPS proper noun, plural Vikings
PDT predeterminer both the boys
POS possessive ending friend's
Penn Treebank tagset (2/2)
Tag Description Example
PRP personal pronoun I, he, it
PRP$ possessive pronoun my, his
RB adverb however, usually, naturally, here, good
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO to to go, to him
UH interjection uhhuhhuhh
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when
Universal POS
http://universaldependencies.org/u/pos/
Universal Features
http://universaldependencies.org/u/feat/
Some Observations
• Ambiguity
– count (noun) vs. count (verb)
– 11% of all types but 40% of all tokens in the Brown
corpus are ambiguous.
– Examples
• like can be tagged as ADP VERB ADJ ADV NOUN
• present can be tagged as ADJ NOUN VERB ADV
POS Ambiguity