0% found this document useful (0 votes)
14 views40 pages

Decision Tree Intro MDT903

Decision trees are supervised learning algorithms used for classification, capable of handling both categorical and continuous variables. They involve splitting data into homogeneous subsets based on significant attributes, with key concepts including root nodes, decision nodes, and pruning. While decision trees are easy to interpret and require minimal data cleaning, they can suffer from overfitting and may not perform well with continuous variables.

Uploaded by

gaxosiw600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views40 pages

Decision Tree Intro MDT903

Decision trees are supervised learning algorithms used for classification, capable of handling both categorical and continuous variables. They involve splitting data into homogeneous subsets based on significant attributes, with key concepts including root nodes, decision nodes, and pruning. While decision trees are easy to interpret and require minimal data cleaning, they can suffer from overfitting and may not perform well with continuous variables.

Uploaded by

gaxosiw600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Classification:

Decision Trees

MDT903: AI Practitioner
Decision Tre

 Decision tree is a type of supervised


learning algorithm (having a pre-defined
target variable) that is mostly used in
classification problems.
 It works for both categorical and continuous
input and output variables. In this
technique, we split the population or
sample into two or more homogeneous sets
(or sub-populations) based on most
significant splitter / differentiator in input
variables.
2
Types of Decision Trees

 Types of decision tree is based on the type of


target variable we have. It can be of two types:
 Categorical Variable Decision Tree: Decision
Tree which has categorical target variable then it
called as categorical variable decision tree.
Example:- In above scenario of student problem,
where the target variable was “Student will play
cricket or not” i.e. YES or NO.
 Continuous Variable Decision Tree: Decision
Tree has continuous target variable then it is called
as Continuous Variable Decision Tree.

3
Important Terminology
related to Decision Trees
 Root Node: It represents entire population or sample and this
further gets divided into two or more homogeneous sets.
 Splitting: It is a process of dividing a node into two or more sub-
nodes.
 Decision Node: When a sub-node splits into further sub-nodes,
then it is called decision node.
 Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal
Pruning:
node. When we remove sub-nodes of a
decision node, this process is called pruning.
You can say opposite process of splitting.
Branch / Sub-Tree: A sub section of entire
tree is called branch or sub-tree.
Parent and Child Node: A node, which is
divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of
parent node
4
Advantages
 Easy to Understand: Decision tree output is very easy to
understand even for people from non-analytical background. It does
not require any statistical knowledge to read and interpret them. Its
graphical representation is very intuitive and users can easily relate
their hypothesis.
 Useful in Data exploration: Decision tree is one of the fastest
way to identify most significant variables and relation between two
or more variables. With the help of decision trees, we can create
new variables / features that has better power to predict target
variable. It can also be used in data exploration stage. For example,
we are working on a problem where we have information available
in hundreds of variables, there decision tree will help to identify
most significant variable.
 Less data cleaning required: It requires less data cleaning
compared to some other modeling techniques. It is not influenced
by outliers and missing values to a fair degree.
 Data type is not a constraint: It can handle both numerical and
categorical variables.
 Non Parametric Method: Decision tree is considered to be a non-
parametric method. This means
5 that decision trees have no
Disadvantages

 Over fitting: Over fitting is one of the most


practical difficulty for decision tree models. This
problem gets solved by setting constraints on
model parameters and pruning (discussed in
detailed below).
 Not fit for continuous variables: While working
with continuous numerical variables, decision tree
looses information when it categorizes variables in
different categories.

6
Outline

 Top-Down Decision Tree Construction


 Choosing the Splitting Attribute
 Information Gain and Gain Ratio

7
DECISION TREE

8
DECISION TREE
 An internal node is a test on an attribute.
 A branch represents an outcome of the
test, e.g., Color=red.
 A leaf node represents a class label or
class label distribution.
 At each node, one attribute is chosen to
split training examples into distinct classes
as much as possible
 A new case is classified by following a
matching path to a leaf node.
9
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes
no relation to
rain cool normal true No
overcast cool normal true Yes
Microsoft
sunny mild high false No email program
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

10
Example Tree for “Play?”

Outlook

sunny rain
overcast

Humidity Yes
Windy

high normal true false

No Yes No Yes

11
Building Decision Tree [Q93]

 Top-down tree construction


 At start, all training examples are at the root.
 Partition the examples recursively by choosing
one attribute each time.
 Bottom-up tree pruning
 Remove subtrees or branches, in a bottom-up
manner, to improve the estimated accuracy on
new cases.

12
Choosing the Splitting Attribute

 At each node, available attributes are


evaluated on the basis of separating the
classes of the training examples. A
Goodness function is used for this purpose.
 Typical goodness functions:
 information gain (ID3/C4.5)
 information gain ratio
 gini index

13
witten&eibe
Which attribute to select?

14
witten&eibe
A criterion for attribute
selection
 Which is the best attribute?
 The one which will result in the smallest tree
 Smallest tree is not the same as shallowest tree!
 Heuristic: choose the attribute that produces
the “purest” nodes
 Popular impurity criterion: information gain
 Information gain increases with the average
purity of the subsets that an attribute produces
 Strategy: choose attribute that results in
greatest information gain

15
witten&eibe
*Claude “Father of
Shannon
Born: 30 April 1916 information
Died: 23 February 2001 theory”
Claude Shannon, who has died aged 84,
perhaps more than anyone laid the
groundwork for today’s digital revolution. His
exposition of information theory, stating that
all information could be represented
mathematically as a succession of noughts
and ones, facilitated the digital manipulation
of data without which today’s information
society would be unthinkable.
Shannon’s master’s thesis, obtained in 1940
at MIT, demonstrated that problem solving
could be achieved by manipulating the
symbols 0 and 1 in a process that could be
carried out automatically with electrical
circuitry. That dissertation has been hailed as
one of the most significant master’s theses of
the 20th century. Eight years later, Shannon
published another landmark paper, A
Mathematical Theory of Communication,
Shannon applied the same radical approach to cryptography research, in which
generally taken as his most important
he later became a consultant to the US government.
scientific contribution.
Many of Shannon’s pioneering insights were developed before they could be
applied in practical form. He was truly a remarkable man, yet unknown to most
16
ofwitten&eibe
the world.
Computing information

 Information is measured in bits


 Given a probability distribution, the info
required to predict an event is the distribution’s
entropy
 Entropy gives the information required in bits
(this can involve fractions of bits!)
 Formula for computing the entropy:
entropy( p1 , p2 , , pn )  p1logp1  p2 logp2   pn logpn

17
witten&eibe
Information Gain
 We now return to the problem of trying to
determine the best attribute to choose for a
particular node in a tree. The following measure
calculates a numerical value for a given
attribute, A, with respect to a set of examples, S.
Note that the values of attribute A will range
over a set of possibilities which we call
Values(A), and that, for a particular value from
that set, v, we write Sv for the set of examples
which have value v for attribute A.
 The information gain of attribute A, relative to a
collection of examples, S, is calculated as:

The information gain of an attribute can be seen as the expected


reduction in entropy caused by knowing the value of attribute A.
Example: attribute “Outlook”, 1
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No
overcast hot high false Yes
rain mild high false Yes
rain cool normal false Yes
rain cool normal true No
overcast cool normal true Yes
sunny mild high false No
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No

19
witten&eibe
Computing the information
gain
 Information gain:
(information before split) – (information after
split)
gain(" Outlook" ) info([9,5] ) - info([2,3] , [4,0], [3,2]) 0.940 - 0.693
0.247 bits

 Information gain
gain(" for )attributes
Outlook" 0.247 bits from
weather gain("
data: Temperatur e" ) 0.029 bits
Most
Information
Gained
gain(" Humidity") 0.152 bits
gain(" Windy" ) 0.048 bits
20
witten&eibe
Continuing to split

gain(" Humidity") 0.971 bits


gain("Temperatur e" ) 0.571 bits

gain(" Windy" ) 0.020 bits

21
witten&eibe
The final decision tree

 Note: not all leaves need to be pure;


sometimes identical instances have
different classes
 Splitting stops when data can’t be split any
further
witten&eibe 22
Highly-branching attributes

 Problematic: attributes with a large


number of values (extreme case: ID code)
 Subsets are more likely to be pure if there
is a large number of values
Þ Information gain is biased towards choosing
attributes with a large number of values
Þ This may result in overfitting (selection of an
attribute that is non-optimal for prediction)

23
witten&eibe
Weather Data with ID code
ID Outlook Temperature Humidity Windy Play?
A sunny hot high false No
B sunny hot high true No
C overcast hot high false Yes
D rain mild high false Yes
E rain cool normal false Yes
F rain cool normal true No
G overcast cool normal true Yes
H sunny mild high false No
I sunny cool normal false Yes
J rain mild normal false Yes
K sunny mild normal true Yes
L overcast mild high true Yes
M overcast hot normal false Yes
N rain mild high true No

24
Split for ID Code Attribute

Entropy of split = 0 (since each leaf node is “pure”, having only


one case.

Information gain is maximal for ID code


gain(" ID" )info([9,5 ]) - 0 0.940
25
witten&eibe
Gain ratio
 Gain ratio: a modification of the
information gain that reduces its bias on
high-branch attributes
 Gain ratio should be
 Large when data is evenly spread
 Small when all data belong to one branch
 Gain ratio takes number and size of
branches into account when choosing an
attribute
 It corrects the information gain by taking the
intrinsic information of a split into account (i.e.
how much info do we need to tell which branch
26
witten&eibe
Gain Ratio and Intrinsic Info.
 Intrinsic information: entropy of
distribution of instances into branches
|S | |S |
IntrinsicInfo(S , A)   i log i .
|S| 2 | S |
 Gain ratio (Quinlan’86) normalizes info
gain by:

GainRatio(S , A)  Gain(S , A) .
IntrinsicInfo(S , A)

27
Computing the gain ratio
 Example: intrinsic information for ID code
info ([1,1,  ,1) 14 ( 1 / 14 log 1 / 14) 3.807 bits
 Importance of attribute decreases as
intrinsic information gets larger
 Example of gain ratio:
gain(" Attribute")
gain_ratio(" Attribute") 
intrinsic_info("Attribute")

 Example: gain_ratio(" ID_code") 0.940 bits 0.246


3.807 bits

28
witten&eibe
Gain ratios for weather data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: 1.362
info([4,6,4])
Gain ratio: 0.156 Gain ratio: 0.021
0.247/1.577 0.029/1.362

Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.049
0.048/0.985

0.940 bits
gain_ratio(" ID_code")  0.246
3.807 bits

29
witten&eibe
More on the gain ratio
 “Outlook” still comes out top
 However: “ID code” has greater gain ratio
 Standard fix: ad hoc test to prevent splitting on
that type of attribute
 Problem with gain ratio: it may
overcompensate
 May choose an attribute just because its
intrinsic information is very low
 Standard fix:
 First, only consider attributes with greater than
average information gain
 Then, compare them on gain ratio
30
witten&eibe
*CART Splitting Criteria: Gini
Index
 If a data set T contains examples from n
classes, gini index, gini(T) is defined as
n
gini (T ) 1   p2j
j 1

where pj is the relative frequency of class j in T.


gini(T) is minimized if the classes in T are
skewed.

31
*Gini Index
After splitting T into two subsets T1 and T2 with
sizes N1 and N2, the gini index of the split
data is defined as
gini split (T )  N 1 gini(T 1)  N 2 gini(T 2 )
N N

 The attribute providing smallest ginisplit(T) is


chosen to split the node.

32
Discussion

 Algorithm for top-down induction of


decision trees (“ID3”) was developed by
Ross Quinlan
 Gain ratio just one modification of this basic
algorithm
 Led to development of C4.5, which can deal
with numeric attributes, missing values, and
noisy data
 Similar approach: CART (to be covered
later)
 There are many other attribute selection
33
Summary

 Top-Down Decision Tree Construction


 Choosing the Splitting Attribute
 Information Gain biased towards attributes
with a large number of values
 Gain Ratio takes number and size of
branches into account when choosing an
attribute

34
What is Random Forest ?
How does it work?
 Random Forest is a versatile machine learning method
capable of performing both regression and classification
tasks. It also undertakes dimensional reduction methods,
treats missing values, outlier values and other essential steps
of data exploration, and does a fairly good job. It is a type of
ensemble learning method, where a group of weak models
combine to form a powerful model.
 In Random Forest, we grow multiple trees as opposed to a
single tree in CART model. To classify a new object based on
attributes, each tree gives a classification and we say the
tree “votes” for that class. The forest chooses the
classification having the most votes

35
 It works in the following manner. Each tree is planted &
grown as follows:
 Assume number of cases in the training set is N. Then,
sample of these N cases is taken at random but with
replacement. This sample will be the training set for growing
the tree.
 If there are M input variables, a number m<M is specified
such that at each node, m variables are selected at random
out of the M. The best split on these m is used to split the
node. The value of m is held constant while we grow the
Each tree is grown to the largest extent
forest.
possible and there is no pruning.
Predict new data by aggregating the
predictions of the ntree trees (i.e., majority
votes for classification, average for
regression).

36
Advantages of Random
Forest
This algorithm can solve both type of problems i.e.
classification and regression and does a decent estimation at
both fronts.
 One of benefits of Random forest which excites me most is,
the power of handle large data set with higher
dimensionality. It can handle thousands of input variables
and identify most significant variables so it is considered as
one of the dimensionality reduction methods. Further, the
model outputs Importance of variable, which can be a
very handy feature (on some random data set).

37
 It has an effective method for estimating missing data and
maintains accuracy when a large proportion of the data are
missing.
 It has methods for balancing errors in data sets where
classes are imbalanced.
 The capabilities of the above can be extended to unlabeled
data, leading to unsupervised clustering, data views and
outlier detection.
 Random Forest involves sampling of the input data with
replacement called as bootstrap sampling. Here one third of
the data is not used for training and can be used to testing.
These are called the out of bag samples. Error estimated on
these out of bag samples is known as out of bag error. Study
of error estimates by Out of bag, gives evidence to show that
the out-of-bag estimate is as accurate as using a test set of
the same size as the training set. Therefore, using the out-of-
bag error estimate removes
38 the need for a set aside test set.
Disadvantages of Random
Forest
 It surely does a good job at classification but not as
good as for regression problem as it does not give
precise continuous nature predictions. In case of
regression, it doesn’t predict beyond the range in
the training data, and that they may over-fit data
sets that are particularly noisy.
 Random Forest can feel like a black box approach
for statistical modelers – you have very little
control on what the model does. You can at best –
try different parameters and random seeds!

39
 KDnugget : Data Mining Course by G.
Platesky Shapiro and G. Parker
 Chapter 7- Text Book J. Han and M. Kamber

40

You might also like