0% found this document useful (0 votes)
2 views44 pages

Eisner-Probability How To Use Prob

This document outlines the fundamentals of probability notation and models, particularly in the context of language identification. It discusses how to interpret and manipulate probability expressions, the significance of different types of statistics, and the construction of probability models. Additionally, it emphasizes the importance of conditional independence and the application of the chain rule in estimating probabilities for sequences of events.

Uploaded by

yarno.prc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views44 pages

Eisner-Probability How To Use Prob

This document outlines the fundamentals of probability notation and models, particularly in the context of language identification. It discusses how to interpret and manipulate probability expressions, the significance of different types of statistics, and the construction of probability models. Additionally, it emphasizes the importance of conditional independence and the application of the chain rule in estimating probabilities for sequences of events.

Uploaded by

yarno.prc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

How to Use Probabilities

The Crash Course


Jason Eisner

1
Goals of this lecture

• Probability notation like p(X | Y):


– What does this expression mean?
– How can I manipulate it?
– How can I estimate its value in practice?
• Probability models:
– What is one?
– Can we build one for language ID?
– How do I know if my model is any good?

600.465 – Intro to NLP – J. Eisner 2


3 Kinds of Statistics

• descriptive: mean Hopkins SAT (or median)

• confirmatory: statistically significant?

• predictive: wanna bet?


this course – why?

600.465 – Intro to NLP – J. Eisner 3


Notation for Greenhorns

0.9
“Paul
probability
Revere”
model

p(Paul Revere wins | weather’s clear) = 0.9

600.465 – Intro to NLP – J. Eisner 4


What does that really mean?
p(Paul Revere wins | weather’s clear) = 0.9

• Past performance?
– Revere’s won 90% of races with clear weather
• Hypothetical performance?
– If he ran the race in many parallel universes …
• Subjective strength of belief?
– Would pay up to 90 cents for chance to win $1
• Output of some computable formula?
– Ok, but then which formulas should we trust?
p(X | Y) versus q(X | Y)
600.465 – Intro to NLP – J. Eisner 5
p is a function on event sets
p(win | clear)  p(win, clear) / p(clear)

weather’s
clear
Paul Revere
wins

All Events (races)


600.465 – Intro to NLP – J. Eisner 6
p is a function on event sets
p(win | clear)  p(win, clear) / p(clear)
syntactic sugar logical conjunction predicate selecting
of predicates races where
weather’s clear

weather’s
clear p measures total
Paul Revere
wins
probability of a
All Events (races)
set of events.
600.465 – Intro to NLP – J. Eisner 7
most of the
Required Properties of p (axioms)

• p() = 0 p(all events) = 1


• p(X)  p(Y) for any X  Y
• p(X) + p(Y) = p(X  Y) provided X  Y=
e.g., p(win & clear) + p(win & clear) = p(win)

weather’s
clear p measures total
Paul Revere
wins
probability of a
All Events (races)
set of events.
600.465 – Intro to NLP – J. Eisner 8
Commas denote conjunction
p(Paul Revere wins, Valentine places, Epitaph
shows | weather’s clear)
what happens as we add conjuncts to left of bar ?
• probability can only decrease
• numerator of historical estimate likely to go to zero:
# times Revere wins AND Val places… AND weather’s clear
# times weather’s clear

600.465 – Intro to NLP – J. Eisner 9


Commas denote conjunction
p(Paul Revere wins, Valentine places, Epitaph
shows | weather’s clear)
p(Paul Revere wins | weather’s clear, ground is
dry, jockey getting over sprain, Epitaph also in race, Epitaph
was recently bought by Gonzalez, race is on May 17, … )
what happens as we add conjuncts to right of bar ?
• probability could increase or decrease
• probability gets more relevant to our case (less bias)
• probability estimate gets less reliable (more variance)
# times Revere wins AND weather clear AND … it’s May 17
# times weather clear AND … it’s May 17
600.465 – Intro to NLP – J. Eisner 10
Simplifying Right Side: Backing Off

p(Paul Revere wins | weather’s clear, ground is


dry, jockey getting over sprain, Epitaph also in race, Epitaph
was recently bought by Gonzalez, race is on May 17, … )
not exactly what we want but at least we can get a
reasonable estimate of it!
(i.e., more bias but less
variance)
try to keep the conditions that we suspect will have the
most influence on whether Paul Revere wins
600.465 – Intro to NLP – J. Eisner 11
Simplifying Right Side: Backing Off
p(Paul Revere wins, Valentine places, Epitaph
shows | weather’s clear)
NOT ALLOWED!
but we can do something similar to help …

600.465 – Intro to NLP – J. Eisner 12


Factoring Left Side: The Chain Rule
p(Revere, Valentine, Epitaph | weather’s clear) RVEW/W
= p(Revere | Valentine, Epitaph, weather’s clear) = RVEW/VEW
* p(Valentine | Epitaph, weather’s clear) * VEW/EW
* p(Epitaph | weather’s clear) * EW/W
True because numerators cancel against denominators
Makes perfect sense when read from bottom to top
Moves material to right of bar so it can be ignored

If this prob is unchanged by backoff, we say Revere was


CONDITIONALLY INDEPENDENT of Valentine and Epitaph
(conditioned on the weather’s being clear). Often we just
ASSUME conditional independence to get the nice product above.
600.465 – Intro to NLP – J. Eisner 13
Remember Language ID?
• “Horses and Lukasiewicz are on the curriculum.”

• Is this English or Polish or what?


• We had some notion of using n-gram models …

• Is it “good” (= likely) English?


• Is it “good” (= likely) Polish?

• Space of events will be not races but character


sequences (x1, x2, x3, …) where xn = EOS

600.465 – Intro to NLP – J. Eisner 14


Remember Language ID?

• Let p(X) = probability of text X in English


• Let q(X) = probability of text X in Polish
• Which probability is higher?
– (we’d also like bias toward English since it’s
more likely a priori – ignore that for now)

“Horses and Lukasiewicz are on the curriculum.”

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)

600.465 – Intro to NLP – J. Eisner 15


Apply the Chain Rule

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


= p(x1=h) 4470/ 52108

* p(x2=o | x1=h) 395/ 4470


5/ 395
* p(x3=r | x1=h, x2=o)
3/ 5
* p(x4=s | x1=h, x2=o, x3=r)
3/ 3
* p(x5=e | x1=h, x2=o, x3=r, x4=s) 0/ 3
* p(x6=s | x1=h, x2=o, x3=r, x4=s, x5=e)
*… =0 counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 16
Back Off On Right Side

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


 p(x1=h) 4470/ 52108

* p(x2=o | x1=h) 395/ 4470


5/ 395
* p(x3=r | x1=h, x2=o)
12/ 919
* p(x4=s | x2=o, x3=r)
12/ 126
* p(x5=e | x3=r, x4=s) 3/ 485
* p(x6=s | x4=s, x5=e)
* … = 7.3e-10 * … counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 17
Change the Notation

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


 p(x1=h) 4470/ 52108
* p(x2=o | x1=h) 395/ 4470
* p(xi=r | xi-2=h, xi-1=o, i=3) 5/ 395
* p(xi=s | xi-2=o, xi-1=r, i=4) 12/ 919

* p(xi=e | xi-2=r, xi-1=s, i=5) 12/ 126


3/ 485
* p(xi=s | xi-2=s, xi-1=e, i=6)
* … = 7.3e-10 * … counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 18
Another Independence Assumption

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


 p(x1=h) 4470/ 52108
* p(x2=o | x1=h) 395/ 4470

* p(xi=r | xi-2=h, xi-1=o) 1417/ 14765


1573/ 26412
* p(xi=s | xi-2=o, xi-1=r)
1610/ 12253
* p(xi=e | xi-2=r, xi-1=s)
2044/ 21250
* p(xi=s | xi-2=s, xi-1=e)
* … = 5.4e-7 * … counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 19
Simplify the Notation

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


 p(x1=h) 4470/ 52108

* p(x2=o | x1=h) 395/ 4470

* p(r | h, o) 1417/ 14765


1573/ 26412
* p(s | o, r)
1610/ 12253
* p(e | r, s)
2044/ 21250
* p(s | s, e)
*… counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 20
Simplify the Notation

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


the parameters
 p(h | BOS, BOS) of our old 4470/ 52108
* p(o | BOS, h) trigram generator! 395/ 4470
Same assumptions
* p(r | h, o) about language. 1417/ 14765
* p(s | o, r) values of 1573/ 26412
those
* p(e | r, s) parameters, 1610/ 12253
as naively 2044/ 21250
* p(s | s, e) estimated
* … These basic probabilities from Brown
corpus.
are used to define p(horses) counts from
Brown corpus
600.465 – Intro to NLP – J. Eisner 21
Simplify the Notation

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


the parameters
 t BOS, BOS, h of our old 4470/ 52108
trigram generator! 395/ 4470
* t BOS, h, o Same assumptions
about language. 1417/ 14765
* t h, o, r
values of 1573/ 26412
* t o, r, s those
parameters, 1610/ 12253
* t r, s, e as naively 2044/ 21250
estimated
* t s, e,sThis notation emphasizes that from Brown
corpus. counts from
* … whose value must be estimated
they’re just real variables
Brown corpus
600.465 – Intro to NLP – J. Eisner 22
Definition: Probability Model

param Trigram Model definition


values (defined in terms of p
of parameters like
t h, o, r and t o, r, s )

generate find event


random probabilities
text
600.465 – Intro to NLP – J. Eisner 23
English vs. Polish

English
param definition
values of p
Trigram Model
(defined in terms
Polish of parameters like definition
param t h, o, r and t o, r, s ) of q
values
p are
com compute
compute p(X)
q(X)
600.465 – Intro to NLP – J. Eisner 24
What is “X” in p(X)?
• Element of some implicit “event space”
• e.g., race
definition
• e.g., sentence
of p
• What if event is a whole text?
• p(text) definition
= p(sentence 1, sentence 2, …) of q
= p(sentence 1)
* p(sentence 2 | sentence 1) are
p
*… com compute
compute p(X)
q(X)
600.465 – Intro to NLP – J. Eisner 25
What is “X” in “p(X)”?
• Element of some implicit “event space”
• e.g., race, sentence, text …
• Suppose an event is a sequence of letters:
p(horses)

• But we rewrote p(horses) as


p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)
p(x1=h) * p(x2=o | x1=h) * …are
o mp
c
• What does this variable=value notation mean?

600.465 – Intro to NLP – J. Eisner 26


Random Variables:
What is “variable” in “p(variable=value)”?
Answer: variable is really a function of Event
• p(x1=h) * p(x2=o | x1=h) * …
• Event is a sequence of letters
• x2 is the second letter in the sequence
• p(number of heads=2) or just p(H=2)
• Event is a sequence of 3 coin flips
• H is the number of heads
• p(weather’s clear=true) or just m are
p(weather’s
p clear)
co
• Event is a race
• weather’s clear is true or false
600.465 – Intro to NLP – J. Eisner 27
Random Variables:
What is “variable” in “p(variable=value)”?
Answer: variable is really a function of Event
• p(x1=h) * p(x2=o | x1=h) * …
• Event is a sequence of letters
• x2(Event) is the second letter in the sequence
• p(number of heads=2) or just p(H=2)
• Event is a sequence of 3 coin flips
• H(Event) is the number of heads
• p(weather’s clear=true) or just m are
p(weather’s
p clear)
co
• Event is a race
• weather’s clear (Event) is true or false
600.465 – Intro to NLP – J. Eisner 28
Random Variables:
What is “variable” in “p(variable=value)”?

• p(number of heads=2) or just p(H=2)


• Event is a sequence of 3 coin flips
• H is the number of heads in the event
• So p(H=2)
= p(H(Event)=2) picks out the set of events with 2 heads
= p({HHT,HTH,THH})
= p(HHT)+p(HTH)+p(THH) TTT TTH HTT HTH

THT THH HHT HHH

600.465 – Intro to NLP – J. Eisner All Events 29


Random Variables:
What is “variable” in “p(variable=value)”?

• p(weather’s clear)
• Event is a race
• weather’s clear is true or false of the event

• So p(weather’s clear)
= p(weather’s clear(Event)=true)
picks out the set of events weather’s
with clear weather clear
Paul Revere
wins
p(win | clear)  p(win, clear) / p(clear)
All Events (races)
600.465 – Intro to NLP – J. Eisner 30
Random Variables:
What is “variable” in “p(variable=value)”?

• p(x1=h) * p(x2=o | x1=h) * …


• Event is a sequence of letters
• x2 is the second letter in the sequence
• So p(x2=o)
= p(x2(Event)=o) picks out the set of events with …
=  p(Event) over all events whose second letter …
= p(horses) + p(boffo) + p(xoyzkklp) + …

600.465 – Intro to NLP – J. Eisner 31


Back to trigram model of p(horses)

p(x1=h, x2=o, x3=r, x4=s, x5=e, x6=s, …)


the parameters
 t BOS, BOS, h of our old 4470/ 52108
trigram generator! 395/ 4470
* t BOS, h, o Same assumptions
about language. 1417/ 14765
* t h, o, r
values of 1573/ 26412
* t o, r, s those
parameters, 1610/ 12253
* t r, s, e as naively 2044/ 21250
estimated
* t s, e,sThis notation emphasizes that from Brown
corpus. counts from
* … whose value must be estimated
they’re just real variables
Brown corpus
600.465 – Intro to NLP – J. Eisner 32
A Different Model
• Exploit fact that horses is a common word

p(W1 = horses)
where word vector W is a function of the event (the sentence) just as
character vector X is.
= p(Wi = horses | i=1)
 p(Wi = horses) = 7.2e-5
independence assumption says that sentence-initial words w1 are just like
all other words wi (gives us more data to use)

Much larger than previous estimate of 5.4e-7 – why?


Advantages, disadvantages?

600.465 – Intro to NLP – J. Eisner 33


Improving the New Model:
Weaken the Indep. Assumption
• Don’t totally cross off i=1 since it’s not irrelevant:
– Yes, horses is common, but less so at start of sentence since most
sentences start with determiners.
p(W1 = horses) = t p(W1=horses, T1 = t)
= t p(W1=horses|T1 = t) * p(T1 = t)
= t p(Wi=horses|Ti = t, i=1) * p(T1 = t)
 t p(Wi=horses|Ti = t) * p(T1 = t)
= p(Wi=horses|Ti = PlNoun) * p(T1 = PlNoun)
(if first factor is 0 for any other part of speech)
 (72 / 55912) * (977 / 52108)
= 2.4e-5
600.465 – Intro to NLP – J. Eisner 34
Which Model is Better?

• Model 1 – predict each letter Xi from


previous 2 letters Xi-2, Xi-1
• Model 2 – predict each word Wi by its part
of speech Ti, having predicted Ti from i

• Models make different independence


assumptions that reflect different intuitions
• Which intuition is better???
600.465 – Intro to NLP – J. Eisner 35
Measure Performance!
• Which model does better on language ID?
– Administer test where you know the right answers
– Seal up test data until the test happens
• Simulates real-world conditions where new data comes along that
you didn’t have access to when choosing or training model
– In practice, split off a test set as soon as you obtain the
data, and never look at it
– Need enough test data to get statistical significance
• For a different task (e.g., speech transcription instead
of language ID), use that task to evaluate the models

600.465 – Intro to NLP – J. Eisner 36


Bayes’ Theorem
• p(A | B) = p(B | A) * p(A) / p(B)

• Easy to check by removing syntactic sugar


• Use 1: Converts p(B | A) to p(A | B)
• Use 2: Updates p(A) to p(A | B)

• Stare at it so you’ll recognize it later

600.465 – Intro to NLP – J. Eisner 37


Language ID
• Given a sentence x, I suggested comparing its prob in
different languages:
– p(SENT=x | LANG=english) (i.e.,
penglish(SENT=x))
– p(SENT=x | LANG=polish) (i.e.,
ppolish(SENT=x))
– p(SENT=x | LANG=xhosa) (i.e.,
pxhosa(SENT=x))

• But surely for language ID we should compare


– p(LANG=english | SENT=x)
– p(LANG=polish | SENT=x)
– p(LANG=xhosa | SENT=x)

600.465 – Intro to NLP – J. Eisner 38


Language ID
• For language ID we should compare
– p(LANG=english | SENT=x)
– p(LANG=polish | SENT=x) a posteriori
– p(LANG=xhosa | SENT=x)
• For ease, multiply by p(SENT=x) and compare
– p(LANG=english, SENT=x) sum of these is a way to
– p(LANG=polish, SENT=x) find p(SENT=x); can divide
– p(LANG=xhosa, SENT=x) back by that to get
posterior probs
• Must know prior probabilities; then rewrite as
– p(LANG=english) * p(SENT=x | LANG=english)
– p(LANG=polish) * p(SENT=x | LANG=polish)
– p(LANG=xhosa) * p(SENT=x | LANG=xhosa)
a priori likelihood (what we had before)
600.465 – Intro to NLP – J. Eisner 39
General Case (“noisy channel”)
“noisy channel”
“decoder”
a mess up b
p(A=a) a into b
p(B=b | A=a) most likely
language  text reconstruction of a
text  speech maximize p(A=a | B=b)
spelled  misspelled = p(A=a) p(B=b | A=a) /
English  French (B=b)
= p(A=a) p(B=b | A=a)
/  p(A=a’) p(B=b | A=a’)
600.465 – Intro to NLP – J. Eisner 40
Language ID
• For language ID we should compare
– p(LANG=english | SENT=x)
– p(LANG=polish | SENT=x) a posteriori
– p(LANG=xhosa | SENT=x)
• For ease, multiply by p(SENT=x) and compare
– p(LANG=english, SENT=x)
– p(LANG=polish, SENT=x)
– p(LANG=xhosa, SENT=x)
• Must know prior probabilities; then rewrite as
– p(LANG=english) * p(SENT=x | LANG=english)
– p(LANG=polish) * p(SENT=x | LANG=polish)
– p(LANG=xhosa) * p(SENT=x | LANG=xhosa)
a priori likelihood
600.465 – Intro to NLP – J. Eisner 41
General Case (“noisy channel”)
• Want most likely A to have generated evidence B
– p(A = a1 | B = b)
– p(A = a2 | B = b) a posteriori
– p(A = a3 | B = b)
• For ease, multiply by p(SOUND=x) and compare
– p(A = a1, B = b)
– p(A = a2, B = b)
– p(A = a3, B = b)
• Must know prior probabilities; then rewrite as
– p(A = a1) * p(B = b | A = a1)
– p(A = a2) * p(B = b | A = a2)
– p(A = a3) * p(B = b | A = a3)
a priori likelihood
600.465 – Intro to NLP – J. Eisner 42
Speech Recognition
• For baby speech recognition we should compare
– p(MEANING=gimme | SOUND=uhh)
– p(MEANING=changeme | SOUND=uhh) a posteriori
– p(MEANING=loveme | SOUND=uhh)
• For ease, multiply by p(SOUND=uhh) & compare
– p(MEANING=gimme, SOUND=uhh)
– p(MEANING=changeme, SOUND=uhh)
– p(MEANING=loveme, SOUND=uhh)
• Must know prior probabilities; then rewrite as
– p(MEAN=gimme) * p(SOUND=uhh | MEAN=gimme)
– p(MEAN=changeme) * p(SOUND=uhh | MEAN=changeme)
– p(MEAN=loveme) * p(SOUND=uhh | MEAN=loveme)
a priori likelihood
600.465 – Intro to NLP – J. Eisner 43
Life or Death!
• p(diseased) = 0.001 so p(diseased) = 0.999

• p(positive test | diseased) = 0.05 “false pos”


• p(negative test | diseased) = x  0 “false neg”
so p(positive test | diseased) = 1-x  1

• What is p(diseased | positive test)?


– don’t panic - still very small! < 1/51 for any x

600.465 – Intro to NLP – J. Eisner 44

You might also like