Decision Tree Intro MDT903
Decision Tree Intro MDT903
Decision Trees
MDT903: AI Practitioner
Decision Tre
3
Important Terminology
related to Decision Trees
Root Node: It represents entire population or sample and this
further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-
nodes.
Decision Node: When a sub-node splits into further sub-nodes,
then it is called decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal
Pruning:
node. When we remove sub-nodes of a
decision node, this process is called pruning.
You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire
tree is called branch or sub-tree.
Parent and Child Node: A node, which is
divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of
parent node
4
Advantages
Easy to Understand: Decision tree output is very easy to
understand even for people from non-analytical background. It does
not require any statistical knowledge to read and interpret them. Its
graphical representation is very intuitive and users can easily relate
their hypothesis.
Useful in Data exploration: Decision tree is one of the fastest
way to identify most significant variables and relation between two
or more variables. With the help of decision trees, we can create
new variables / features that has better power to predict target
variable. It can also be used in data exploration stage. For example,
we are working on a problem where we have information available
in hundreds of variables, there decision tree will help to identify
most significant variable.
Less data cleaning required: It requires less data cleaning
compared to some other modeling techniques. It is not influenced
by outliers and missing values to a fair degree.
Data type is not a constraint: It can handle both numerical and
categorical variables.
Non Parametric Method: Decision tree is considered to be a non-
parametric method. This means
5 that decision trees have no
Disadvantages
6
Outline
7
DECISION TREE
8
DECISION TREE
An internal node is a test on an attribute.
A branch represents an outcome of the
test, e.g., Color=red.
A leaf node represents a class label or
class label distribution.
At each node, one attribute is chosen to
split training examples into distinct classes
as much as possible
A new case is classified by following a
matching path to a leaf node.
9
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes
no relation to
rain cool normal true No
overcast cool normal true Yes
Microsoft
sunny mild high false No email program
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
10
Example Tree for “Play?”
Outlook
sunny rain
overcast
Humidity Yes
Windy
No Yes No Yes
11
Building Decision Tree [Q93]
12
Choosing the Splitting Attribute
13
witten&eibe
Which attribute to select?
14
witten&eibe
A criterion for attribute
selection
Which is the best attribute?
The one which will result in the smallest tree
Smallest tree is not the same as shallowest tree!
Heuristic: choose the attribute that produces
the “purest” nodes
Popular impurity criterion: information gain
Information gain increases with the average
purity of the subsets that an attribute produces
Strategy: choose attribute that results in
greatest information gain
15
witten&eibe
*Claude “Father of
Shannon
Born: 30 April 1916 information
Died: 23 February 2001 theory”
Claude Shannon, who has died aged 84,
perhaps more than anyone laid the
groundwork for today’s digital revolution. His
exposition of information theory, stating that
all information could be represented
mathematically as a succession of noughts
and ones, facilitated the digital manipulation
of data without which today’s information
society would be unthinkable.
Shannon’s master’s thesis, obtained in 1940
at MIT, demonstrated that problem solving
could be achieved by manipulating the
symbols 0 and 1 in a process that could be
carried out automatically with electrical
circuitry. That dissertation has been hailed as
one of the most significant master’s theses of
the 20th century. Eight years later, Shannon
published another landmark paper, A
Mathematical Theory of Communication,
Shannon applied the same radical approach to cryptography research, in which
generally taken as his most important
he later became a consultant to the US government.
scientific contribution.
Many of Shannon’s pioneering insights were developed before they could be
applied in practical form. He was truly a remarkable man, yet unknown to most
16
ofwitten&eibe
the world.
Computing information
17
witten&eibe
Information Gain
We now return to the problem of trying to
determine the best attribute to choose for a
particular node in a tree. The following measure
calculates a numerical value for a given
attribute, A, with respect to a set of examples, S.
Note that the values of attribute A will range
over a set of possibilities which we call
Values(A), and that, for a particular value from
that set, v, we write Sv for the set of examples
which have value v for attribute A.
The information gain of attribute A, relative to a
collection of examples, S, is calculated as:
19
witten&eibe
Computing the information
gain
Information gain:
(information before split) – (information after
split)
gain(" Outlook" ) info([9,5] ) - info([2,3] , [4,0], [3,2]) 0.940 - 0.693
0.247 bits
Information gain
gain(" for )attributes
Outlook" 0.247 bits from
weather gain("
data: Temperatur e" ) 0.029 bits
Most
Information
Gained
gain(" Humidity") 0.152 bits
gain(" Windy" ) 0.048 bits
20
witten&eibe
Continuing to split
21
witten&eibe
The final decision tree
23
witten&eibe
Weather Data with ID code
ID Outlook Temperature Humidity Windy Play?
A sunny hot high false No
B sunny hot high true No
C overcast hot high false Yes
D rain mild high false Yes
E rain cool normal false Yes
F rain cool normal true No
G overcast cool normal true Yes
H sunny mild high false No
I sunny cool normal false Yes
J rain mild normal false Yes
K sunny mild normal true Yes
L overcast mild high true Yes
M overcast hot normal false Yes
N rain mild high true No
24
Split for ID Code Attribute
GainRatio(S , A) Gain(S , A) .
IntrinsicInfo(S , A)
27
Computing the gain ratio
Example: intrinsic information for ID code
info ([1,1, ,1) 14 ( 1 / 14 log 1 / 14) 3.807 bits
Importance of attribute decreases as
intrinsic information gets larger
Example of gain ratio:
gain(" Attribute")
gain_ratio(" Attribute")
intrinsic_info("Attribute")
28
witten&eibe
Gain ratios for weather data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: 1.362
info([4,6,4])
Gain ratio: 0.156 Gain ratio: 0.021
0.247/1.577 0.029/1.362
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.049
0.048/0.985
0.940 bits
gain_ratio(" ID_code") 0.246
3.807 bits
29
witten&eibe
More on the gain ratio
“Outlook” still comes out top
However: “ID code” has greater gain ratio
Standard fix: ad hoc test to prevent splitting on
that type of attribute
Problem with gain ratio: it may
overcompensate
May choose an attribute just because its
intrinsic information is very low
Standard fix:
First, only consider attributes with greater than
average information gain
Then, compare them on gain ratio
30
witten&eibe
*CART Splitting Criteria: Gini
Index
If a data set T contains examples from n
classes, gini index, gini(T) is defined as
n
gini (T ) 1 p2j
j 1
31
*Gini Index
After splitting T into two subsets T1 and T2 with
sizes N1 and N2, the gini index of the split
data is defined as
gini split (T ) N 1 gini(T 1) N 2 gini(T 2 )
N N
32
Discussion
34
What is Random Forest ?
How does it work?
Random Forest is a versatile machine learning method
capable of performing both regression and classification
tasks. It also undertakes dimensional reduction methods,
treats missing values, outlier values and other essential steps
of data exploration, and does a fairly good job. It is a type of
ensemble learning method, where a group of weak models
combine to form a powerful model.
In Random Forest, we grow multiple trees as opposed to a
single tree in CART model. To classify a new object based on
attributes, each tree gives a classification and we say the
tree “votes” for that class. The forest chooses the
classification having the most votes
35
It works in the following manner. Each tree is planted &
grown as follows:
Assume number of cases in the training set is N. Then,
sample of these N cases is taken at random but with
replacement. This sample will be the training set for growing
the tree.
If there are M input variables, a number m<M is specified
such that at each node, m variables are selected at random
out of the M. The best split on these m is used to split the
node. The value of m is held constant while we grow the
Each tree is grown to the largest extent
forest.
possible and there is no pruning.
Predict new data by aggregating the
predictions of the ntree trees (i.e., majority
votes for classification, average for
regression).
36
Advantages of Random
Forest
This algorithm can solve both type of problems i.e.
classification and regression and does a decent estimation at
both fronts.
One of benefits of Random forest which excites me most is,
the power of handle large data set with higher
dimensionality. It can handle thousands of input variables
and identify most significant variables so it is considered as
one of the dimensionality reduction methods. Further, the
model outputs Importance of variable, which can be a
very handy feature (on some random data set).
37
It has an effective method for estimating missing data and
maintains accuracy when a large proportion of the data are
missing.
It has methods for balancing errors in data sets where
classes are imbalanced.
The capabilities of the above can be extended to unlabeled
data, leading to unsupervised clustering, data views and
outlier detection.
Random Forest involves sampling of the input data with
replacement called as bootstrap sampling. Here one third of
the data is not used for training and can be used to testing.
These are called the out of bag samples. Error estimated on
these out of bag samples is known as out of bag error. Study
of error estimates by Out of bag, gives evidence to show that
the out-of-bag estimate is as accurate as using a test set of
the same size as the training set. Therefore, using the out-of-
bag error estimate removes
38 the need for a set aside test set.
Disadvantages of Random
Forest
It surely does a good job at classification but not as
good as for regression problem as it does not give
precise continuous nature predictions. In case of
regression, it doesn’t predict beyond the range in
the training data, and that they may over-fit data
sets that are particularly noisy.
Random Forest can feel like a black box approach
for statistical modelers – you have very little
control on what the model does. You can at best –
try different parameters and random seeds!
39
KDnugget : Data Mining Course by G.
Platesky Shapiro and G. Parker
Chapter 7- Text Book J. Han and M. Kamber
40