0% found this document useful (0 votes)
69 views32 pages

DS4 - CLS-Decision Tree

This document provides an overview of decision trees for classification. It explains how decision trees work to reduce uncertainty and classify data by building a tree structure that recursively partitions the data space and performs predictions at the leaf nodes. The key steps discussed are selecting the best attributes to split on at each node to maximize information gain or minimize impurity, and determining when to stop splitting the data to avoid overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views32 pages

DS4 - CLS-Decision Tree

This document provides an overview of decision trees for classification. It explains how decision trees work to reduce uncertainty and classify data by building a tree structure that recursively partitions the data space and performs predictions at the leaf nodes. The key steps discussed are selecting the best attributes to split on at each node to maximize information gain or minimize impurity, and determining when to stop splitting the data to avoid overfitting.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Science for Business

Lesson 4: Classification - Decision Tree

Dr. Le, Hai Ha


Contents
 Decision Tree
 Reduce Uncetainty
 Build Decision Tree
 Sumary of Decision Tree

2
Decision Tree
• Example: a student’s rules for studying or playing

3
How to build decision tree from data?
• Given data, how to build a classifier model (decision tree)?

4
Decision tree
Golf play: Yes (red), No (blue)

5
Decision tree
• Classification tree: to separate a dataset into classes
belonging to the response variable
• Intuitive and easy to set up
• Easy to interpret
• Usually has two classes: Yes or No (1 or 0)
• But can has more than two categories
• Regression trees: are used for numeric prediction problems

6
Golf data example

7
Ideas to build decision tree
• Define the order of attributes at each step
• For problems with many attributes and each
attribute having many different values, finding the
optimal solution is offen not feasible
• A simple method is that at each step the best
attribute is selected based on some criterion
• For each selected attribute, we devide the data into
child nodes corresponding to the values of that
attribute and then continue to apply this method to
each child node.
• Choosing the best attribute at each step like this
called greedy selection.
8
How to reduce uncetainty
• Imagine a box that can contain one of three colored balls
inside—red, yellow, and blue
• Without opening the box, if one had to “predict” which
colored ball is inside, then they are basically dealing with a
lack of information or uncertainty.
• What is the highest number of “yes/no” questions that can
be asked to reduce this uncertainty and, thus, increase our
information?
1. Is it red? No.
2. Is it yellow? No.
Then it must be blue.
• That is two questions.

ball-in-a-box problem
9
How to reduce uncetainty
• The maximum number of binary questions needed to
reduce uncertainty is essentially log(T), where the log is
taken to base 2 and T is the number of possible outcomes
• If there was only one color, that is, one outcome, then log(1)
= 0, which means there is no uncertainty
• If there are T events with equal probability of occurrence P,
then T = 1/P
• Claude Shannon defined Entropy as , or where is the
probability of an event occurring
• If the probability for all events is not identical, a weighted
expression is needed and, thus, entropy, , is adjusted as
follows:

10
Entropy
• Graph of the entropy function with

• P is purity: hay ,
• P is impurity: , max value when
11
Entropy and Gini index
• If the dataset had 100 samples with 50% of each, then the
entropy of the dataset is given by

• On the other hand, if the data can be partitioned into two sets of
50 samples each that exclusively contain all members and all
nonmembers, the entropy of either of these two partitioned sets
is given by
• Any other proportion of samples within a dataset will yield
entropy values between 0 and 1 (which is the maximum)
• The Gini index (G) is similar to the entropy measure in its
characteristics and is defined as

• The value of ranges between 0 and a maximum value of 0.5, but


otherwise has properties identical to
12
Measure of impurity
• Every split ties to make child node more pure.

13
Split criteria
• The measure of impurity of a dataset must be at a
maximum when all possible classes are equally
represented.
• The measure of impurity of a dataset must be zero
when only one class is represented.
• Measures such as entropy or Gini index easily meet
these criteria and are used to build decision trees.
• Different criteria will build different trees through
different biases.

14
Build Decision Tree
• Two steps:
• Step 1: Where to Split Data?
• Step 2: When to Stop Splitting Data?
• The Iterative Dichotomizer (ID3) algorithm
• Other algorithm is Classification and Regression
Tree (CART)

15
Step 1: Where to Split Data?

16
Algorithm
• Working with a non-leaf node, data points, points belong
to class (). Entroy of this node is:
(1)
• Select attribute . Base on , the data points are classified into
child nodes with the number of points in each child node
being . Define
(2)
• Information gain based on the x attribute is defined:

• In ID3, at each node, the selected attribute is determined


based on:

17
Example – Football team play or not?

18
Entropy at root node

=0.94

19
Consider outlook attribute

20
Outlook attribute

0.69

21
Temperature, Humidity, Wind attributes

Outlook is selected in the first step

22
Golf data example

23
• Start by partitioning the
data on each of the four
regular attributes
• Let us start with
Outlook. There are
three categories for this
variable: sunny,
overcast, and rain.

24
• For numeric variables, possible split points to examine are
essentially averages of available values. For example, the
first potential split point for Humidity could be Average
[65,70], which is 67.5, the next potential split point could be
Average [70,75], which is 72.5, and so on.

25
26
27
Step 2: When to Stop Splitting Data?
• The algorithm would need to be instructed when to stop.
• There are several situations where the process can be terminated:
• No attribute satisfies a minimum information gain threshold
• A maximal depth is reached: as the tree grows larger, not only does
interpretation get harder, but a situation called “overfitting” is induced.
• There are less than a certain number of examples in the current subtree:
again, a mechanism to prevent overfitting.
• To prevent overfitting, tree growth may need to be restricted or
reduced, using a process called pruning
• pre-pruning: the pruning occurs before or during the growth of
the tree.
• There are also methods that will not restrict the number of
branches and allow the tree to grow as deep as the data will allow,
and then trim or prune those branches that do not effectively
change the classification error rates. This is called post-pruning.

28
Post pruning
• Reduced error pruning: rely on a validation set
• Regularization: add regularization into loss function,
the regularization will be large if the number of leaf
nodes is large

• First, construct a decision tree where every point in the


training set is properly classified (all entopy of nodes is
zero). Now the data loss is zero but regularization can be
large, making L large.
• Then prune the leaf nodes so that L decreases. The
pruning is repeated until L can no longer be reduced.

29
Decision Tree – Model building
1. Calculate Shannon Entropy (no partition)
2. Calculate weighted entropy of each independent
variable on the target variable
3. Compute Information gain
4. Choose the independent variable with the
highest information gain as a divided node
5. Repeat this process. If the entropy of a variable is
zero, then that variable becomes a leaf node

30
Summary of Decision Tree
• A decision tree model takes a form of decision flowchart
• An attribute is tested in each node
• At end of the decision tree path is a leaf node where a
prediction is made about the target variable based on
conditions set forth by the decision path.
• The nodes split the dataset into subsets.
• In a decision tree, the idea is to split the dataset based on
the homogeneity of data.
• A rigorous measure of impurity is needed, which meets
certain criteria, based on computing a proportion of the
data that belong to a class.

31
• RapidMiner process files and data sets
• http://www.introdatascience.com/uploads/4/2/1/5
/42154413/second_ed_rm_process.zip

32

You might also like