0% found this document useful (0 votes)
993 views

Maximum Entropy

the creme dela creme of discriminative learning plus GIS to solve these hellacious equations, with even a horribly painful example.

Uploaded by

phli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
993 views

Maximum Entropy

the creme dela creme of discriminative learning plus GIS to solve these hellacious equations, with even a horribly painful example.

Uploaded by

phli
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Maximum Likelihood

USC Linguistics

August 8, 2007

Ratnaparkhi (1998), p. 8 (slightly modified):


f (a,b)
wj j
Q
j
p(a|b) = P Q f (a,b) (1)
j
a j wj

training:
X
L(p) = p̃(a, b) log p(a|b) (2)
a,b

Generative Iterative Scaling (GIS) (Ratnaparkhi (1998) p. 14, citing Darroch and
Ratcliff (1972)):
X
f (a, b) = C (3)
f

if necessary:
X X
C = max f (a, b) ; fn+1 (a, b) = C − f (a, b) (4)
a,b
f f
 1
Ep̃ fj C
wj:0 = 1 ; wj:i+1 = wj:i (5)
Epi fj
X
Epi fj = p̃(b)fj (a, b)pi (a|b) (6)
a,b

f (a,b)
wj:ij
Q
j
pi (a|b) = P Q f (a,b) (7)
j
a j wj:i
X 1 X
Ep̃ fj = p̃(a, b)fj (a, b) = fj (an , bn ) (8)
N n
a,b

1
c(b,a) cp1 (b) cp2 (b) a f1 f2 f3 f4 fc
20 T T T 1 1 0 0 0
0 T T F 0 0 0 0 2
0 T F T 1 0 0 0 1
20 T F F 0 0 0 1 1
0 F T T 0 1 0 0 1
20 F T F 0 0 1 0 1
0 F F T 0 0 0 0 2
20 F F F 0 0 1 1 0
80 20 20 40 40 40
.25 .25 .5 .5 .5

C=2 (9)

" #1
1 P 2
N n fj (an , bn )
wj:i = wj:i−1 P (10)
a,b p̃(b)fj (a, b)pi (a|b)
Q j f (a,b)
j wj:i
pi (a|b) = P Q f (a,b) (11)
j
a j wj:i

Ep̃ f1,2 = 20/80 = .25 ; Ep̃ f3,4,c = 40/80 = .5 (12)

f1 f2 f3 f4 fc
wx:0 1 1 1 1 1
wx:1 1 1 1.41 1.41 .71
wx:2 .96 .96 1.71 1.71 .57
wx:3 .91 .91 1.94 1.94 .47
T|TT F|TT T|TF F|TF T|FT F|FT T|FF F|FF
p0 .5 .5 .5 .5 .5 .5 .5 .5
p1 .66 .33 .41 .59 .41 .59 .2 .8
p2 .74 .26 .36 .64 .36 .64 .1 .9
p3 .79 .21 .32 .68 .32 .68 .06 .94
f1 f2 f3 f4 fc
Ep0 .25 .25 .25 .25 1
Ep1 .27 .27 .34 .34 .77
Ep2 .28 .28 .39 .39 .73
Ep3 .28 .28 .41 .41 .73

2
11 ∗ 11 ∗ 10 ∗ 10 ∗ 10
p0 (T |(T, T )) = = .5 (13)
(11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )
10 ∗ 10 ∗ 10 ∗ 10 ∗ 12
p0 (F |(T, T )) = = .5 (14)
(11 ∗ 11 ∗ 10 ∗ 10 ∗ 10 ) + (10 ∗ 10 ∗ 10 ∗ 10 ∗ 12 )

Ep0 f1 = (.25 ∗ 1 ∗ p0 (T |(T, T ))) + (.25 ∗ 1 ∗ p0 (T |(T, F ))) = .25 (15)

Ep0 f3 = (.25 ∗ 1 ∗ p0 (F |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, F ))) = .25 (16)

Ep0 fc = (.25 ∗ 2 ∗ p0 (F |(T, T )))+


(.25 ∗ 1 ∗ p0 (T |(T, F ))) + (.25 ∗ 1 ∗ p0 (F |(T, F )))+
(17)
(.25 ∗ 1 ∗ p0 (T |(F, T ))) + (.25 ∗ 1 ∗ p0 (F |(F, T )))+
(.25 ∗ 2 ∗ p0 (T |(F, F ))) = 1
√ √
w1:1 = w2:1 = 1 ; w3:1 = w4:1 = 2 ; wc:1 = .5 (18)

11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710


p1 (T |(T, T )) = = .66 (19)
(11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712


p1 (F |(T, T )) = = .33 (20)
(11 ∗ 11 ∗ 1.410 ∗ 1.410 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711


p1 (T |(T, F )) = = .41 (21)
(11 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .711 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.411 ∗ .711 )

10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710


p1 (F |(T, F )) = = .59 (22)
(10 ∗ 10 ∗ 1.411 ∗ 1.411 ∗ .710 ) + (10 ∗ 10 ∗ 1.410 ∗ 1.410 ∗ .712 )

.712
p1 (T |(F, F )) = = .2 (23)
(.712 ) + (1.41 ∗ 1.41)
1.41 ∗ 1.41
p1 (F |(F, F )) = = .8 (24)
(.712 )+ (1.41 ∗ 1.41)

Ep1 f1 = (.25 ∗ 1 ∗ .66) + (.25 ∗ 1 ∗ .41) = .27 (25)

Ep1 f3 = (.25 ∗ 1 ∗ .59) + (.25 ∗ 1 ∗ .8) = .34 (26)

w1:2 = w2:2 = .96 ; w3:2 = w4:2 = 1.71 ; wc:2 = .57 (27)

3
References
Berger, Adam L., Pietra, Stephen Della, and Pietra, Vincent J. Della (1996) “A Maximum
Entropy Approach to Natural Language Processing,” Computational Linguistics, 22(1),
39–71.

Darroch, J.N. and Ratcliff, D. (1972) “Generalized iterative scaling for log-linear models,”
Ann. Math. Statist., 43, 1470–1480.

Ratnaparkhi, A. (1998) “Maximum Entropy Models for Natural Language Ambiguity Res-
olution,” Ph.D Thesis, University of Pennsylvania.

You might also like