Decision Trees: USC Linguistics July 26, 2007

This document discusses using decision trees to classify events based on their temporal orientation. It presents a table categorizing example events using features like "-ed", "-s", and whether they refer to past, present or future tense. The document also discusses using information gain and gain ratio to determine the most informative features for classification. It provides equations for calculating information gain, gain ratio, and the chi-squared test for measuring feature independence.

Uploaded by

phli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

172 views6 pages

Decision Trees: USC Linguistics July 26, 2007

Uploaded by

phli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Decision Trees

USC Linguistics

July 26, 2007

Suppose we wish to classify events according to their temporal orientations.1

-ed -s will be [sing] t(e)

He read. F F F F T t(e)<ST
He was hungry. F F F T T t(e)<ST
He ran. F F F F T t(e)<ST
They liked it. T F F F F t(e)<ST
He reads. F T F F T t(e)∩ST
He is happy. F F F T T t(e)∩ST
They need it. T F F F F t(e)∩ST
They want it. F F F F F t(e)∩ST
He will read. F F T F T t(e)>ST
They will get it. F F T F F t(e)>ST
n
X
I(P (v1 ), ..., P (vn )) = −P (vi )log2 P (vi ) (1)
i=1
4 4 2
I( , , ) = .5288 + .5288 + .4644 = 1.522 (2)
10 10 10
X c(Ai ) c(v1 ) c(vn )

Remainder(A) = I , ..., (3)
N c(Ai ) c(Ai )
i

Gain(A) = I(P (v1 ), ..., P (vn )) − Remainder(A) (4)

2 1 1 8 3 3 2
R(−ed) = I( , , 0) + I( , , ) = .2 ∗ 1 + .8 ∗ 1.56 = 1.45 (5)
10 2 2 10 8 8 8
1 9 4 3 2
R(−s) = I(0, 1, 0) + I( , , ) = .1 ∗ 0 + .9 ∗ 1.53 = 1.38 (6)
10 10 9 9 9
1
Notice immediately, that there is no way to disambiguate the written representation “They read”;
unless, perhaps, the past is more likely to be eventive, and more likely to want a direct object.

1
2 8 4 4
R(will) = I(0, 0, 1) + I( , , 0) = .2 ∗ 0 + .8 ∗ 1 = .8 (7)
10 10 8 8
2 1 1 8 3 3 2
R(be) = I( , , 0) + I( , , ) = .2 ∗ 1 + .8 ∗ 1.56 = 1.45 (8)
10 2 2 10 8 8 8
6 3 2 1 4 1 2 1
R([sing]) = I( , , ) + I( , , ) = .6 ∗ 1.45 + .4 ∗ 1.5 = 1.475 (9)
10 6 6 6 10 4 4 4
<<<< ∩ ∩ ∩∩ >>

will:T will:F

>> <<<< ∩ ∩ ∩∩

4 4
I( , ) = 1 (10)
8 8
2 1 1 6 3 3
R(−ed) = I( , ) + I( , ) = .25 + .75 = 1 (11)
8 2 2 8 6 6
1 7 4 3
R(−s) = I(0, 1) + I( , ) = 0 + .86 = .86 (12)
8 8 7 7
2 1 1 6 3 3
R(be) = I( , ) + I( , ) = .25 + .75 = 1 (13)
8 2 2 8 6 6
5 3 2 3 1 2
R([sing]) = I( , ) + I( , ) = .61 + .34 = .95 (14)
8 5 5 8 3 3
<<<< ∩ ∩ ∩∩ >>

will:T will:F

>> <<<< ∩ ∩ ∩∩

-s:T -s:F

∩ <<<< ∩ ∩ ∩

2
-ed -s w-1 [sing] t(e)
He read. F F - T t(e)<ST
He was hungry. F F was T t(e)<ST
He ran. F F - T t(e)<ST
They liked it T F - F t(e)<ST
He reads. F T - T t(e)∩ST
He is happy F F is T t(e)∩ST
They need it. T F - F t(e)∩ST
They want it. F F - F t(e)∩ST
He will read. F F will T t(e)>ST
They will get it. F F will F t(e)>ST

6 3 3 1 1 2
R(w − 1) = I( , , 0) + I(1, 0, 0) + I(0, 1, 0) + I(0, 0, 1) = .6 (15)
10 6 6 10 10 10
<<<< ∩ ∩ ∩∩ >>

w-1:- w-1:was w-1:is w-1:will

<<< ∩ ∩ ∩ < ∩ >>

-ed -s w-1 [sing] t(e)

He read. F F he T t(e)<ST
He was hungry. F F was T t(e)<ST
He ran. F F he T t(e)<ST
They liked it T F they F t(e)<ST
He reads. F T he T t(e)∩ST
He is happy F F is T t(e)∩ST
They need it. T F they F t(e)∩ST
They want it. F F they F t(e)∩ST
He will read. F F will T t(e)>ST
They will get it. F F will F t(e)>ST

3 2 1 1 3 1 2 1 2
R(w − 1) = I( , , 0) + I(1, 0, 0) + I( , , 0) + I(0, 1, 0) + I(0, 0, 1) (16)
10 3 3 10 10 3 3 10 10

R(w − 1) = .55 (17)

3
Venkataraman2 credits Quinlan with the concept of GainRatio:

Gain(A)
GainRatio(A) = X (18)
−P (v)log2 P (v)
v∈A

1.522 − .55 .972

GainRatio(w − 1) = 3 1 3 1 2 = = .45 (19)
I( 10 , 10 , 10 , 10 , 10 ) 2.17
1.522 − .6 .922
GainRatio(w − 1) = 6 1 1 2 = 1.57 = .59 (20)
I( 10 , 10 , 10 , 10 )
1.522 − .8 .722
GainRatio(will) = 8 2 = =1 (21)
I( 10 , 10 ) .722

1 χ2
• “Don’t use low numbers, especially not zero.”

T otal(row) ∗ T otal(col)
E= (22)
N
‘Expected’ by the “independent hypothesis”3 :

Observed Expected
< ∩ > < ∩ >
- 30 30 0 60 - 24 24 12 60
was 10 0 0 10 was 4 4 2 10
is 0 10 0 10 is 4 4 2 10
will 0 0 20 20 will 8 8 4 20
total 40 40 20 100 total 40 40 20 100

Table 1: y:ind. var.; x: dep. var.

X (O − E)2
χ2 = (23)
E

(30 − 24)2 (30 − 24)2 (0 − 12)2 (10 − 4)2 (0 − 4)2 (0 − 2)2

χ2 = + + + + + + (24)
24 24 12 4 4 2
2
http://www.speech.sri.com/people/anand/771/html/node29.html
3
Null in the sense of ‘to be nullified’, not in the sense of maximum entropy.

4
(0 − 4)2 (10 − 4)2 (0 − 2)2 (0 − 8)2 (0 − 8)2 (20 − 4)2
+ + + + + = 125 (25)
4 4 2 8 8 4

f ◦ = (rows − 1)(cols − 1) (26)

gamma function:
Z∞
Γ(α) = xα−1 e−x dx (27)
0

for integers:
Γ(n) = (n − 1)! (28)
gamma pdf:
λα α−1 −λx
p(x) = x e (29)
Γ(α)
χ2 :
f◦ 1
α= ,λ = (30)
2 2
f◦
( 12 ) 2 f◦
−1 1 2
2
p(χ ) = ◦ x 2 e− 2 χ (31)
Γ( f2 )