DS4 - CLS-Decision Tree
DS4 - CLS-Decision Tree
2
Decision Tree
• Example: a student’s rules for studying or playing
3
How to build decision tree from data?
• Given data, how to build a classifier model (decision tree)?
4
Decision tree
Golf play: Yes (red), No (blue)
5
Decision tree
• Classification tree: to separate a dataset into classes
belonging to the response variable
• Intuitive and easy to set up
• Easy to interpret
• Usually has two classes: Yes or No (1 or 0)
• But can has more than two categories
• Regression trees: are used for numeric prediction problems
6
Golf data example
7
Ideas to build decision tree
• Define the order of attributes at each step
• For problems with many attributes and each
attribute having many different values, finding the
optimal solution is offen not feasible
• A simple method is that at each step the best
attribute is selected based on some criterion
• For each selected attribute, we devide the data into
child nodes corresponding to the values of that
attribute and then continue to apply this method to
each child node.
• Choosing the best attribute at each step like this
called greedy selection.
8
How to reduce uncetainty
• Imagine a box that can contain one of three colored balls
inside—red, yellow, and blue
• Without opening the box, if one had to “predict” which
colored ball is inside, then they are basically dealing with a
lack of information or uncertainty.
• What is the highest number of “yes/no” questions that can
be asked to reduce this uncertainty and, thus, increase our
information?
1. Is it red? No.
2. Is it yellow? No.
Then it must be blue.
• That is two questions.
ball-in-a-box problem
9
How to reduce uncetainty
• The maximum number of binary questions needed to
reduce uncertainty is essentially log(T), where the log is
taken to base 2 and T is the number of possible outcomes
• If there was only one color, that is, one outcome, then log(1)
= 0, which means there is no uncertainty
• If there are T events with equal probability of occurrence P,
then T = 1/P
• Claude Shannon defined Entropy as , or where is the
probability of an event occurring
• If the probability for all events is not identical, a weighted
expression is needed and, thus, entropy, , is adjusted as
follows:
10
Entropy
• Graph of the entropy function with
• P is purity: hay ,
• P is impurity: , max value when
11
Entropy and Gini index
• If the dataset had 100 samples with 50% of each, then the
entropy of the dataset is given by
• On the other hand, if the data can be partitioned into two sets of
50 samples each that exclusively contain all members and all
nonmembers, the entropy of either of these two partitioned sets
is given by
• Any other proportion of samples within a dataset will yield
entropy values between 0 and 1 (which is the maximum)
• The Gini index (G) is similar to the entropy measure in its
characteristics and is defined as
13
Split criteria
• The measure of impurity of a dataset must be at a
maximum when all possible classes are equally
represented.
• The measure of impurity of a dataset must be zero
when only one class is represented.
• Measures such as entropy or Gini index easily meet
these criteria and are used to build decision trees.
• Different criteria will build different trees through
different biases.
14
Build Decision Tree
• Two steps:
• Step 1: Where to Split Data?
• Step 2: When to Stop Splitting Data?
• The Iterative Dichotomizer (ID3) algorithm
• Other algorithm is Classification and Regression
Tree (CART)
15
Step 1: Where to Split Data?
16
Algorithm
• Working with a non-leaf node, data points, points belong
to class (). Entroy of this node is:
(1)
• Select attribute . Base on , the data points are classified into
child nodes with the number of points in each child node
being . Define
(2)
• Information gain based on the x attribute is defined:
17
Example – Football team play or not?
18
Entropy at root node
=0.94
19
Consider outlook attribute
20
Outlook attribute
0.69
21
Temperature, Humidity, Wind attributes
22
Golf data example
23
• Start by partitioning the
data on each of the four
regular attributes
• Let us start with
Outlook. There are
three categories for this
variable: sunny,
overcast, and rain.
24
• For numeric variables, possible split points to examine are
essentially averages of available values. For example, the
first potential split point for Humidity could be Average
[65,70], which is 67.5, the next potential split point could be
Average [70,75], which is 72.5, and so on.
25
26
27
Step 2: When to Stop Splitting Data?
• The algorithm would need to be instructed when to stop.
• There are several situations where the process can be terminated:
• No attribute satisfies a minimum information gain threshold
• A maximal depth is reached: as the tree grows larger, not only does
interpretation get harder, but a situation called “overfitting” is induced.
• There are less than a certain number of examples in the current subtree:
again, a mechanism to prevent overfitting.
• To prevent overfitting, tree growth may need to be restricted or
reduced, using a process called pruning
• pre-pruning: the pruning occurs before or during the growth of
the tree.
• There are also methods that will not restrict the number of
branches and allow the tree to grow as deep as the data will allow,
and then trim or prune those branches that do not effectively
change the classification error rates. This is called post-pruning.
28
Post pruning
• Reduced error pruning: rely on a validation set
• Regularization: add regularization into loss function,
the regularization will be large if the number of leaf
nodes is large
29
Decision Tree – Model building
1. Calculate Shannon Entropy (no partition)
2. Calculate weighted entropy of each independent
variable on the target variable
3. Compute Information gain
4. Choose the independent variable with the
highest information gain as a divided node
5. Repeat this process. If the entropy of a variable is
zero, then that variable becomes a leaf node
30
Summary of Decision Tree
• A decision tree model takes a form of decision flowchart
• An attribute is tested in each node
• At end of the decision tree path is a leaf node where a
prediction is made about the target variable based on
conditions set forth by the decision path.
• The nodes split the dataset into subsets.
• In a decision tree, the idea is to split the dataset based on
the homogeneity of data.
• A rigorous measure of impurity is needed, which meets
certain criteria, based on computing a proportion of the
data that belong to a class.
31
• RapidMiner process files and data sets
• http://www.introdatascience.com/uploads/4/2/1/5
/42154413/second_ed_rm_process.zip
32