p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes
p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes
USC Linguistics
Table 1: n=counts
p(α)p(β|α)
p(α|β) = (8)
p(β)
∗
Thanks to David Wilczynski and USC’s CSCI 561 slides for the general gist of
this brief introduction. Also to Grenager’s Stanford Lecture notes (http://www-
nlp.stanford.edu/∼grenager/cs121/handouts/cs121 lecture06 4pp.pdf), and particularly John A. Carroll’s
Sussex notes (http://www.informatics.susx.ac.uk/courses/nlp/lecturenotes/corpus2.pdf) for the tip on
feature products; also wikipedia for its clear presentation of multiple variables.
1
Extending to more variables:
c(fi, `)
p(fi|`) = P (10)
j c(fj , `)
c(`)
p(`) = P (11)
i c(`i )
A new event is assigned the label which maximizes the following product.
Y
p(`) p(fi|`) (12)
i
1.1 Problems
2
(p(α, β|γ) = p(α|γ)p(β|γ)) ↔ (p(α|β, γ) = p(α|γ)) (16)
Much thanks to Greg Lawler, of the University of Chicago, who, in a fortuitous flight
meeting, provided this elegant example exception:
• β: green die=1
• γ: red die=6
but anyways:
Qn
p(`) i=0 p(fi|`)
p(`|f0, ..., fn) = Qn (23)
i=0 p(fi )
3
Also notice how this equation can give p>1:
c(`) c(f )
p(`) = ; p(f ) = (25)
c(L) c(F )
c(`) c(`,f )
c(L) c(`) c(`, f )/c(L) c(`, f )
p(`|f ) = c(f )
= = (26)
c(f )/c(F ) c(f )
c(F )
2 smoothing
2.1 Linear Interpolation
control the non-conditioned significance. tune α on reserved data.
p(x|y) = αp̂(x|y) + (1 − α)p̂(x) (27)
2.2 Laplace
c(x) + k c(x) + k
p(x) = P = (28)
x [c(x) + k] N + k|X|
c(x, y) + k
p(x|y) = (29)
c(y) + k|X|
If k=1, we are pretending we saw everything once more than we actually did; even
things that we never saw!
4
2.3 another caveat?
c(`) + |F |
p(`) = (30)
N + |F L|
c(f, `) + 1
p(f |`) = (31)
c(`) + |F |
c(f ) + |L|
p(f ) = (32)
N + |F L|
|F | is the number of feature types, |L| is the number of label types,
and |F L| is their product.
c(`, f ) + 1
∴ p(`|f ) = (33)
c(f ) + |L|