0% found this document useful (0 votes)
134 views

p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes

1. Bayes' theorem provides a way to calculate the probability of an event given certain observed evidence or conditions. It relates the conditional and marginal probabilities of events. 2. When using Bayes' theorem to classify events based on features, the "naive" approach assumes features are conditionally independent given the class. However, this assumption does not always hold. 3. Smoothing techniques like linear interpolation and Laplace smoothing are used to address data sparsity issues that can arise when estimating probabilities from limited training data. They introduce a degree of smoothing to prevent probabilities of zero.

Uploaded by

phli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views

p (α, β) = p (α - β) p (β) = p (α) p (β - α) (7) p (α) p (β - α) p (β) (8) : Basic Bayes

1. Bayes' theorem provides a way to calculate the probability of an event given certain observed evidence or conditions. It relates the conditional and marginal probabilities of events. 2. When using Bayes' theorem to classify events based on features, the "naive" approach assumes features are conditionally independent given the class. However, this assumption does not always hold. 3. Smoothing techniques like linear interpolation and Laplace smoothing are used to address data sparsity issues that can arise when estimating probabilities from limited training data. They introduce a degree of smoothing to prevent probabilities of zero.

Uploaded by

phli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Basic Bayes∗

USC Linguistics

December 20, 2007


α ¬α
β n11 n01
¬β n10 n00

Table 1: n=counts

N = n11 + n01 + n10 + n00 (1)


n11
p(α, β) = (2)
N
n11 + n10 n11 + n01
p(α) = (3) p(β) = (5)
N N
n11 p(α, β) n11 p(α, β)
p(β|α) = = p(α|β) = =
n11 + n10 p(α) n11 + n01 p(β)
(4) (6)

p(α, β) = p(α|β)p(β) = p(α)p(β|α) (7)


“Bayes’ Theorem”:

p(α)p(β|α)
p(α|β) = (8)
p(β)

Thanks to David Wilczynski and USC’s CSCI 561 slides for the general gist of
this brief introduction. Also to Grenager’s Stanford Lecture notes (http://www-
nlp.stanford.edu/∼grenager/cs121/handouts/cs121 lecture06 4pp.pdf), and particularly John A. Carroll’s
Sussex notes (http://www.informatics.susx.ac.uk/courses/nlp/lecturenotes/corpus2.pdf) for the tip on
feature products; also wikipedia for its clear presentation of multiple variables.

1
Extending to more variables:

p(α, β, γ) p(α, β, γ) p(α, β)p(γ|α, β) p(α)p(β|α)p(γ|α, β)


p(α|β, γ) = = = = (9)
p(β, γ) p(β)p(γ|β) p(β)p(γ|β) p(β)p(γ|β)

1 The Naive Approach

for `, a label, and f, features of the event1 :

c(fi, `)
p(fi|`) = P (10)
j c(fj , `)

c(`)
p(`) = P (11)
i c(`i )

A new event is assigned the label which maximizes the following product.

Y
p(`) p(fi|`) (12)
i

1.1 Problems

if α and β are independent:

p(α|β) = p(α) (13) p(α, β) = p(α)p(β) (14)


DO NOT IMPLY:
p(α, β|γ) = p(α|γ)p(β|γ) (15)
1
c.f. John A. Carroll

2
(p(α, β|γ) = p(α|γ)p(β|γ)) ↔ (p(α|β, γ) = p(α|γ)) (16)

suppose : p(α, β|γ) = p(α|γ)p(β|γ) (17)


∴ p(α, β, γ) = p(α|γ)p(β, γ) (18)
∴ p(α|β, γ) = p(α|γ) (19)

Much thanks to Greg Lawler, of the University of Chicago, who, in a fortuitous flight
meeting, provided this elegant example exception:

• α: green die + red die= 7

• β: green die=1

• γ: red die=6

p(α|β) = p(α|γ) = p(α) = 1/6 (20)


but,
p(α|β, γ) = 1 (21)

∴ p(α|β, γ) 6= p(α|γ) ∴ p(α, β|γ) 6= p(α|γ)p(β|γ) (22)

but anyways:
Qn
p(`) i=0 p(fi|`)
p(`|f0, ..., fn) = Qn (23)
i=0 p(fi )

3
Also notice how this equation can give p>1:

Assume 3 events: (α,β), (α,γ), (δ,δ)

• p(α) = 2/3 • p(β|α) = 1/2


• p(β) = 1/3
• p(γ) = 1/3 • p(γ|α) = 1/2

p(α)p(β|α)p(γ|α) 2/3 ∗ 1/2 ∗ 1/2


p(α|β, γ) = = = 3/2 (24)
p(β)p(γ) 1/3 ∗ 1/3
Also, be sure: c(L)=c(F)

c(`) c(f )
p(`) = ; p(f ) = (25)
c(L) c(F )
c(`) c(`,f )
c(L) c(`) c(`, f )/c(L) c(`, f )
p(`|f ) = c(f )
= = (26)
c(f )/c(F ) c(f )
c(F )

2 smoothing
2.1 Linear Interpolation
control the non-conditioned significance. tune α on reserved data.
p(x|y) = αp̂(x|y) + (1 − α)p̂(x) (27)

2.2 Laplace

k, “the strength of the prior”. tune k on reserved data.

c(x) + k c(x) + k
p(x) = P = (28)
x [c(x) + k] N + k|X|
c(x, y) + k
p(x|y) = (29)
c(y) + k|X|
If k=1, we are pretending we saw everything once more than we actually did; even
things that we never saw!

4
2.3 another caveat?
c(`) + |F |
p(`) = (30)
N + |F L|
c(f, `) + 1
p(f |`) = (31)
c(`) + |F |
c(f ) + |L|
p(f ) = (32)
N + |F L|
|F | is the number of feature types, |L| is the number of label types,
and |F L| is their product.

c(`, f ) + 1
∴ p(`|f ) = (33)
c(f ) + |L|

You might also like