0% found this document useful (0 votes)
5 views43 pages

תרגול - Decision Trees

The document discusses concepts related to gradients, gradient descent, hyperplanes, and decision trees in machine learning. It explains how to define hyperplanes, the process of finding linear separators, and the challenges with non-linearly separable data, such as the XOR problem. Additionally, it covers decision tree construction, impurity measures, and techniques for pruning trees to avoid overfitting.

Uploaded by

benmizar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views43 pages

תרגול - Decision Trees

The document discusses concepts related to gradients, gradient descent, hyperplanes, and decision trees in machine learning. It explains how to define hyperplanes, the process of finding linear separators, and the challenges with non-linearly separable data, such as the XOR problem. Additionally, it covers decision tree construction, impurity measures, and techniques for pruning trees to avoid overfitting.

Uploaded by

benmizar2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

© Ben Galili IDC

¡ Gradient – the vector of the partial


derivatives. For some !(#$ , #& , … , #( ), the
gradient will be:
,! ,! ,!
*! = , ,…,
,#$ ,#& ,#(
¡ The gradient points in the direction of the
greatest rate of increase of the function, and
its magnitude is the slope of the graph in that
direction
© Ben Galili IDC
¡ Gradient descent – going in the opposite
direction to the gradient (toward the
minimum)
¡ We need the learning rate, alpha, to
determine how fast or slow we will move
towards the minimum (optimal weights)

© Ben Galili IDC


¡ How do we define a hyperplane in the space?
§ The space of the hyperplane is n-1 (if n is the
space that we work on)
§ All the point on the hyperplane solve the equation
!" #" + ⋯ + !& #& = ( (= !* )
▪ Where x are the point coordinate
§ The hyperplane separates the space into two half-
spaces
▪ All the point that the equation result > b
▪ All the point that the equation result < b

© Ben Galili IDC


In order to find the distance
from the hyperplane we −!" + !# = 1
need to normalized w, w0 by −!" + !# − 1 = 0
||w||
(for points on the plane, the −!" + !# = 0
normalization doesn’t
matter, why?)
!# −!" + 2!# = 1
−!" + 2!# − 1 = 0
−!" + 2!# = 0

!# = 1
!# − 1 = 0
!" !# = 0

© Ben Galili IDC


¡ We want to find linear separator:
§ All point above with result greater than 0, will be
belong to the +1 class (or -1)
§ All point under with result lower than 0, will be
belong to the -1 class (or +1)
¡ So, what we need to find?
§ The hyperplane weights ! ∈ # $%& (n hyperplane
weights & the bias !' )
§ We will predict 1 if ∑$
)*& !) +) + !' > 0 and -1
otherwise
© Ben Galili IDC
¡ !" AND !#
0#

0"
¡ Solution?
¡ If 1×!" + 1×!# − 1.5 > 0 predict 1
¡ Otherwise predict -1.
¡ i.e. ,- = −1.5, ," = 1, ,# = 1

© Ben Galili IDC


¡ !" OR !#
*#

*"
¡ Solution?
¡ !" + !# − 0.5 > 0 predict 1
¡ Otherwise -1

© Ben Galili IDC


¡ !" XOR !#
$#

$"
¡ Solution?
¡ There is no solution
¡ Many functions cannot be represented using
a linear separator, i.e., they are not linearly
separable
© Ben Galili IDC
¡ We will talk about linear classifiers in future
recitation
¡ The problem with linear classifiers is that not
all the data is linear separable
¡ We need more ‘tools’ to deal with more
complex data

© Ben Galili IDC


¡ Lets look on the classic XOR problem

!#

!"

¡ There is no linear classifier in this dimension


that can separate the classes
¡ How can we separate them?
© Ben Galili IDC
¡ By intuition, where would you
put the first line (horizontal or
vertical) and why?
¡ Now, we have 2 different
parts, and the same question !#

for each on of them


¡ This procedure produce a !"

decision tree

© Ben Galili IDC


¡ Decision Tree definitions:
§ Each internal node (all
nodes except the leafs)
tests an attribute
§ Each branch corresponds
to attribute value
§ Each leaf node assigns a
classification

© Ben Galili IDC


¡ Some facts on trees:
§ Any tree with no child limit (more than 2 children)
can be converted to binary tree (only 2 children)
§ Any polythetic tree (A node query can involve
more than one property) can be converted to
monothetic tree (Each node query involves only
one property)

© Ben Galili IDC


¡ For continuous variables we choose a
threshold value to split the attribute
¡ For example:
§ we have a continuous attribute ! ∈ 0,100 . If we
are testing for this variable then we create a
threshold value & and ask ! < & or ! ≥ &?

© Ben Galili IDC


© Ben Galili IDC
¡ While there are nodes in the queue do
§ Get next node n
§ If training examples in n perfectly classified
▪ Then continue to next node
§ else
▪ A <- the “best” decision attribute for the set in n But, how can
▪ Assign A as the decision attribute for n we know which
▪ For Each value of A attribute is the
▪ Create new descendant of n
best?
▪ Distribute training examples to descendant nodes
▪ Insert descendent nodes to queue
¡ End While

© Ben Galili IDC


¡ We want to choose the attribute that brings
us closer to perfect classification
¡ In order to do that we need to measure how
far are we from the perfect classification
¡ This measure called impurity:
§ High impurity means – we are far from perfect
classification
§ Low impurity means – we are close to perfect
classification
* Look in the lecture for formal definition
© Ben Galili IDC
¡ How can we use impurity to choose the best
attribute?
§ Calculate the impurity in the current node
§ Calculate a weighted average of the impurity over
the children nodes after a split according to the
test attribute
§ Reduce the second from the first and you will get
the impurity reduce
§ Choose the attribute that cause the largest
impurity reduce
© Ben Galili IDC
¡ The formula for the impurity reduce:
|#) |
∆" #, % = " # − ( " #)
|#|
)∈+,-./0(2)
* Where " is the impurity measure
¡ Important fact:
§ The " measure the impurity according to the classes
distribution in a node
§ The instances are split according to the test attribute
values – A
§ This means that you split the instances according to
an attribute values and then calculate the impurity
according to the classes values
© Ben Galili IDC
¡ There are 2 main implementation of the
impurity criterion
Gini Entropy
/ /
(, |(, | |(, |
Impurity 34"45"678 ( = 1 − +( )< !"#$%&' ( = − + 1%2
( |(| |(|
,-. ,-.

34"4_3?4" = 5"=%$>?#4%"_3?4" =
Goodness |(A | |(A |
of split 34"45"678 ( − + 34"45"678 (A !"#$%&' ( − + !"#$%&' (A
|(| |(|
A∈CDEFGH(I) A∈CDEFGH(I)

© Ben Galili IDC


¡ Uniform distribution:
§ GiniIndex =
' ) )
($ 1 1
1−# =1−+ =1− ≅1
( + +
$%&
§ Entropy =
' '
($ ($ 1 1 1
−# -./ = − # -./ ≅ −+ −1 =1
( ( + + +
$%& $%&

¡ Perfect distribution (all instances have the same class):


§ GiniIndex =
' )
($ + )
1−# =1− =1−1=0
( +
$%&
§ Entropy =
'
($ ($ + +
−# -./ = − -./ = 0
( ( + +
$%&

© Ben Galili IDC


¡ We want to predict if you will pass the test
¡ We have a training set of the last year
students (100 students) with 5 attributes – ID,
Gender, Bagrut Average, hours spend on
study, luck (1-10)
¡ Which attribute the InformationGain will
choose, and why?
¡ If an attribute has many values, the
InformationGain will tend to select it
© Ben Galili IDC
¡ Imagine using the attribute DAY=[D1,…,.D14]

© Ben Galili IDC


¡ To solve that we can use GainRatio:
,$-'./"&#'$!"#$ (, *
!"#$%"&#' (, * =
(01#&,$-'./"&#'$((, *)

¡ Where (01#&,$-'./"&#'$((, *) is the Entropy with respect to the


attribute A
|(6 | |(6 |
(01#&,$-'./"&#'$((, *) = − 5 1':
|(| |(|
6∈8
* In contrast to what we used as Entropy of S, which was with respect to
the target class
¡ Example:
?@
1 1 1
(01#&,$-'./"&#'$ (, ;"< = − 5 1': = −1': = 3.8074
14 14 14
=>?
0.94
!"#$%"&#' (, * = = 0.2469
3.8074

© Ben Galili IDC


¡ Decision trees tend to overfit
¡ This means that our tree gets too specific in
handling the training data

© Ben Galili IDC


¡ This suggests that we should prune or cut off
some branches of the tree to make it smaller,
and therefore get better test error
¡ How do we do this?

© Ben Galili IDC


¡ Post pruning
§ when we traverse the tree (starting from the leaf
nodes) and go all the way up (finishing at the root)
and at each node we decide whether or not to
completely cut off that branch of the tree
§ We decide whether to cut the branch or not based
on whether the splitting attribute helps or not
¡ We can also prune during the tree building =
we won’t create the children nodes
© Ben Galili IDC
¡ The Chi Square test is supposed to tell us:
does splitting according to some attribute
give us a distribution which is completely
random or does it have some predicting
power?
¡ So we check if splitting according to the
chosen attribute gives a distribution which is
similar to exactly random

© Ben Galili IDC


¡ What is exactly random?
¡ If we have in our data Y=0 for 10 times and Y=1 for 90
times then the probability Y=0 is 10% and probability
Y=1 is 90%
¡ If we split according to !" and there are 50 instances
where !" = 1, then if splitting according to !" was
completely random we expect to see 50*0.1 instances
where Y=0 and 50*0.9 instances where Y=1. If we
don’t see this, then splitting according to !" is not a
random distribution and has predictive power.

© Ben Galili IDC


¡ The test itself (assume Y can only take values of 0 \ 1):
# '() *+,-.+/0,
¡ ! "=0 ≈
#1+,-.+/0,
¡ Call 23 = number of instances where 45 = 6
¡ 73 = number of instances where 45 = 6 & " = 0
¡ 93 = number of instances where 45 = 6 & " = 1
¡ ;) = 23 ∗ !(" = 0), ;? = 23 ∗ !(" = 1)
¡ So Chi Square statistic is:
A A
A
73 − ;) 93 − ;?
@ = B +
;) ;?
3∈D.EF0,(GH )

© Ben Galili IDC


¡ Once we have the Chi Square statistic we use
a chart to check if it is significant or not

© Ben Galili IDC


X1 X2 Y Count X1
D=6, p=1, n=5 X1=0 X1=1
1 1 + 2
1 0 + 2 X2
0 1 - 5
X2=0 X2=1
0 0 + 1

D0=1, p0=1, n0=0 D1=5, p1=0, n1=5


" "
"
1% − 34 6% − 37
Where X2=0 ! = $ + = Where X2=1
34 37
%∈'()*+,(./ )
14 −34 " 64 − 37 " 17 − 34 " 67 − 37 "
= + + +
34 37 34 37
" " " "
1 5 5 25
1− 0− 0− 5−
= 6 + 6 + 6 + 6
1 5 5 25
6 6 6 6
© Ben Galili IDC
X1 X2 Y Count X1
D=6, p=1, n=5 X1=0 X1=1
1 1 + 2
1 0 + 2 X2
0 1 - 5
X2=0 X2=1
0 0 + 1

D0=1, p0=1, n0=0 D1=5, p1=0, n1=5


" " " "
1 5 5 25
1− 0− 0− 5− 25 5 5 1
6 6 6 6
!" = + + + = + + + =6
1 5 5 25 6 6 6 6
6 6 6 6
Now, we need to look at the chi square chart for the appropriate degree of freedom (number of
attribute values – 1 in the 2 classes case) with 95% of confidence or 0.05 p-value

© Ben Galili IDC


© Ben Galili IDC
¡ What calculations are needed to find the feature Instance Attraction Weather Classification
to split the root of the decision tree using 1 Swim Hot -
Information Gain 2 Dance Hot +
¡ Reminder: 3 Casino Hot +
|03 | 4 Golf Hot -
!"#$%&'()$"_+')" = -"(%$./ 0 − 2 -"(%$./ 03 5 Swim Mild -
|0|
3∈56789:(<) 6
B Casino Mild -
|0? | |0? | 7 Dance Mild +
-"(%$./ 0 = − 2 C$D
|0| |0| 8 Golf Mild -
?@A
9 Ski Mild +
§ c – number of classes
10 Ski Cold +
§ Values(A) – all the values in the A feature
11 Casino Cold -
¡ We need to calculate: 12 Dance Cold -
§ Entropy(root)
§ Weighted average of the Entropy according to “Attraction”
§ Weighted average of the Entropy according to “Weather”

© Ben Galili IDC


¡ Entropy(root) Instance Attraction Weather Classification
7 7 5 5
!"#$%&' $%%# = −( .%/ + .%/ ) 1 Swim Hot -
12 12 12 12
2 Dance Hot +
¡ Weighted average of the Entropy according to 3 Casino Hot +
“Attraction” 4 Golf Hot -
5 Swim Mild -
D4 2 2 2 3 2 2 1 1 6 Casino Mild -
3 !"#$%&' D4 = −( .%/ + .%/ + .%/
D 12 2 2 12 3 3 3 3 7 Dance Mild +
4∈6789:; <==>?@=ABC
3 1 1 2 2 2 2 2 2 2 2 8 Golf Mild -
+ .%/ + .%/ + .%/ + .%/ )
12 3 3 3 3 12 2 2 12 2 2 9 Ski Mild +
10 Ski Cold +
¡ Weighted average of the Entropy according to 11 Casino Cold -
“Weather” 12 Dance Cold -

D4 4 2 2 2 2
3 !"#$%&' D4 = −( .%/ + .%/
D 12 4 4 4 4
4∈6789:; FG?=HG>
5 2 2 3 3 3 1 1 2 2
+ .%/ + .%/ + .%/ + .%/ )
12 5 5 5 5 12 3 3 3 3

© Ben Galili IDC


¡ Put it all together in the Information Gain formula Instance Attraction Weather Classification
!"#$%&'()$"_+')"(%$$(, Attraction) 1 Swim Hot -
7 7 5 5 2 2 2 2 Dance Hot
=− <$= + <$= +( <$= +
12 12 12 12 12 2 2 3 Casino Hot +
3 2 2 1 1 3 1 1 2 2 4 Golf Hot -
+ <$= + <$= + <$= + <$=
12 3 3 3 3 12 3 3 3 3 5 Swim Mild -
2 2 2 2 2 2 6 Casino Mild -
+ <$= + <$= )
12 2 2 12 2 2 7 Dance Mild +
8 Golf Mild -
!"#$%&'()$"_+')"(%$$(, WB'(ℎB%) 9 Ski Mild +
7 7 5 5 4 2 2 2 2 10 Ski Cold +
=− <$= + <$= +( <$= + <$= 11 Casino Cold -
12 12 12 12 12 4 4 4 4
5 2 2 3 3 3 1 1 2 2 12 Dance Cold -
+ <$= + <$= + <$= + <$= )
12 5 5 5 5 12 3 3 3 3

© Ben Galili IDC


¡ We need to create an object which represents
a tree
¡ Let’s try to state all of the properties that we
need each node of the tree to have
¡ We need nodes and we need each node to
have a children (if a node doesn’t have
children then its called a leaf) and we need
each node to have a parent, unless it’s a root.

© Ben Galili IDC


Public class Node {
Node[] children;
Node parent;

¡ Now we have a tree. Node node = new


Node();
¡ Set children to be nodes and that’s it.
© Ben Galili IDC
¡ For our purposes we need a bit more
¡ We need that if a node is a leaf node then we
know which value to return (i.e. if it’s a leaf
node and we are supposed to output 1 we
should know it).
¡ Also we need to know which attribute is the
splitting attribute

© Ben Galili IDC


Public class Node {
Node[] children;
Node parent;
int attributeIndex;
double returnValue;
}

© Ben Galili IDC


¡ Implement a Decision Tree
¡ Use Chi square pruning with different p-values:
§ Zero – no pruning
§ 0.95
§ 0.75
§ 0.5
§ 0.25
§ 0.05
¡ You should find the relevant value according to the degree of
freedom in the current node
¡ Report for each p-value the following:
§ Training error
§ Test error
§ Max tree height (according to the test data)
§ Average tree height (according to the test data)
© Ben Galili IDC

You might also like