0% found this document useful (0 votes)
19 views48 pages

8.1. Machine Learning Decision Tree

The document provides an overview of decision trees, a machine learning algorithm used for classification and regression tasks. It covers essential concepts such as decision tree terminologies, how decision tree algorithms work, and key metrics like entropy and information gain. The lecture aims to equip beginners with a foundational understanding of decision trees and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views48 pages

8.1. Machine Learning Decision Tree

The document provides an overview of decision trees, a machine learning algorithm used for classification and regression tasks. It covers essential concepts such as decision tree terminologies, how decision tree algorithms work, and key metrics like entropy and information gain. The lecture aims to equip beginners with a foundational understanding of decision trees and their applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

Machine Learning:

Decision Tree
Part – 1
Dr. Oybek Eraliev,
Department of Computer Engineering
Inha University In Tashkent.
Email: [email protected]

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 1


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 2


What is a Decision Tree?
Introduction

Ø Decision trees are a popular machine learning algorithm that can be


used for both regression and classification tasks.

Ø They are easy to understand, interpret, and implement, making them an


ideal choice for beginners in the field of machine learning.

Ø In this lecture, we will cover all aspects of the decision tree algorithm,
including the working principles, different types of decision trees, the
process of building decision trees, and how to evaluate and optimize
these.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 3


What is a Decision Tree?
Introduction

Ø A decision tree, which has a hierarchical structure made up of root,


branches, internal, and leaf nodes, is a non-parametric supervised
learning approach used for classification and regression applications.

Ø It is a tool that has applications spanning several different areas. The


name itself suggests that it uses a flowchart like a tree structure to show
the predictions that result from a series of feature-based splits.

Ø It starts with a root node and ends with a decision made by leaves.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 4


What is a Decision Tree?
Introduction

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 5


What is a Decision Tree?
Types of Decision Tree
Ø ID3: This algorithm measures how mixed up the data is at a node using
something called entropy. It then chooses the feature that helps to clarify
the data the most.

Ø C4.5: This is an improved version of ID3 that can handle missing data and
continuous attributes.

Ø CART: This algorithm uses a different measure called Gini impurity to


decide how to split the data. It can be used for both classification (sorting
data into categories) and regression (predicting continuous values) tasks.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 6


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 7


Decision Tree Terminologies

Ø Root Node: The initial node at the


beginning of a decision tree, where the
entire population or dataset starts
dividing based on various features or
conditions.

Ø Decision Nodes: Nodes resulting from


the splitting of root nodes are known as
decision nodes. These nodes represent
intermediate decisions or conditions
within the tree.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 8


Decision Tree Terminologies

Ø Leaf Nodes: Nodes where further splitting


is not possible, often indicating the final
classification or outcome. Leaf nodes are
also referred to as terminal nodes.

Ø Sub-Tree: Similar to a subsection of a


graph being called a sub-graph, a sub-
section of a decision tree is referred to as a
sub-tree. It represents a specific portion of
the decision tree.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 9


Decision Tree Terminologies

Ø Pruning: The process of removing or


cutting down specific nodes in a decision
tree to prevent overfitting and simplify the
model.

Ø Branch / Sub-Tree: A subsection of the


entire decision tree is referred to as a
branch or sub-tree. It represents a specific
path of decisions and outcomes within the
tree.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 10


Decision Tree Terminologies

Ø Parent and Child Node: In a decision tree,


a node that is divided into sub-nodes is
known as a parent node, and the sub-
nodes emerging from it are referred to as
child nodes.

Ø The parent node represents a decision or


condition, while the child nodes represent
the potential outcomes or further
decisions based on that condition.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 11


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 12


How decision tree algorithms work?

Decision Tree algorithm works in simpler steps:


• Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
• Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
• Branching Out: Based on the answer to that question, it divides the
data into smaller subsets, creating new branches. Each branch
represents a possible route through the tree.
• Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 13
How decision tree algorithms work?
Decision Tree Assumptions

Ø Several assumptions are made to build effective models when creating


decision trees. These assumptions help guide the tree’s construction and
impact its performance. Here are some common assumptions and
considerations when creating decision trees:
• Binary Splits: Decision trees typically make binary splits, meaning
each node divides the data into two subsets based on a single feature
or condition. This assumes that each decision can be represented as a
binary choice.
• Recursive Partitioning: Decision trees use a recursive partitioning
process, where each node is divided into child nodes, and this process
continues until a stopping criterion is met. This assumes that data can
be effectively subdivided into smaller, more manageable subsets.
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 14
How decision tree algorithms work?
Decision Tree Assumptions

Ø Feature Independence: Decision trees often assume that the features


used for splitting nodes are independent. In practice, feature
independence may not hold, but decision trees can still perform well if
features are correlated.

Ø Homogeneity: Decision trees aim to create homogeneous subgroups in


each node, meaning that the samples within a node are as similar as
possible regarding the target variable. This assumption helps in achieving
clear decision boundaries.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 15


How decision tree algorithms work?
Decision Tree Assumptions

Ø Top-Down Greedy Approach: Decision trees are constructed using a top-


down, greedy approach, where each split is chosen to maximize
information gain or minimize impurity at the current node. This may not
always result in the globally optimal tree.

Ø Categorical and Numerical Features: Decision trees can handle both


categorical and numerical features. However, they may require different
splitting strategies for each type.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 16


How decision tree algorithms work?
Decision Tree Assumptions

Ø Overfitting: Decision trees are prone to overfitting when they capture


noise in the data. Pruning and setting appropriate stopping criteria are
used to address this assumption.

Ø Impurity Measures: Decision trees use impurity measures such as Gini


impurity or entropy to evaluate how well a split separates classes. The
choice of impurity measure can impact tree construction.

Ø No Missing Values: Decision trees assume that there are no missing


values in the dataset or that missing values have been appropriately
handled through imputation or other methods.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 17


How decision tree algorithms work?
Decision Tree Assumptions

Ø Equal Importance of Features: Decision trees may assume equal


importance for all features unless feature scaling or weighting is applied
to emphasize certain features.

Ø No Outliers: Decision trees are sensitive to outliers, and extreme values


can influence their construction. Preprocessing or robust methods may
be needed to handle outliers effectively.

Ø Sensitivity to Sample Size: Small datasets may lead to overfitting, and


large datasets may result in overly complex trees. The sample size and
tree depth should be balanced.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 18


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 19


Entropy

Ø Entropy is nothing but the uncertainty in our dataset or measure of


disorder. Let me try to explain this with the help of an example.

Ø Suppose you have a group of friends who decides which movie they can
watch together on Sunday. There are 2 choices for movies, one
is “Lucy” and the second is “Titanic” and now everyone has to tell their
choice. After everyone gives their answer we see that “Lucy” gets 4
votes and “Titanic” gets 5 votes. Which movie do we watch now? Isn’t it
hard to choose 1 movie now because the votes for both the movies are
somewhat equal.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 20


Entropy

Ø This is exactly what we call disorderness, there is an equal number of


votes for both the movies, and we can’t really decide which movie we
should watch. It would have been much easier if the votes for “Lucy” were
8 and for “Titanic” it was 2.

Ø Here we could easily say that the majority of votes are for “Lucy” hence
everyone will be watching this movie.

Ø In a decision tree, the output is mostly “yes” or “no”

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 21


Entropy

Ø The formula for Entropy is shown below:

𝐸 𝑆 = −𝑝 ! 𝑙𝑜𝑔𝑝 ! − 𝑝"𝑙𝑜𝑔𝑝"

Ø Here,
• p+ is the probability of positive class
• p– is the probability of negative class
• S is the subset of the training example

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 22


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 23


How do Decision Trees use Entropy?

Ø Entropy basically measures the


impurity of a node. Impurity is the
degree of randomness; it tells how
random our data is.

Ø A pure sub-split means that either you


should be getting “yes”, or you should
be getting “no”.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 24


How do Decision Trees use Entropy?

Ø Suppose a feature has 8 “yes” and 4


“no” initially, after the first split the left
node gets 5 ‘yes’ and 2 ‘no’ whereas
right node gets 3 ‘yes’ and 2 ‘no’.
Ø We see here the split is not pure, why?
Because we can still see some negative
classes in both the nodes. In order to
make a decision tree, we need to
calculate the impurity of each split, and
when the purity is 100%, we make it as
a leaf node.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 25


How do Decision Trees use Entropy?

Ø To check the impurity of feature 2 and feature 3 we will


take the help for Entropy formula.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 26


How do Decision Trees use Entropy?

Ø To check the impurity of feature 2 and feature 3 we will


take the help for Entropy formula.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 27


How do Decision Trees use Entropy?

Ø We can clearly see from the tree itself


that left node has low entropy or more
purity than right node since left node
has a greater number of “yes” and it is
easy to decide here.

Ø Always remember that the higher the


Entropy, the lower will be the purity
and the higher will be the impurity.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 28


How do Decision Trees use Entropy?

Ø As mentioned earlier the goal of machine learning is to decrease the


uncertainty or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if the parent
entropy or the entropy of a particular node has decreased or not.

Ø For this, we bring a new metric called “Information gain” which tells us
how much the parent entropy has decreased after splitting it with some
feature.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 29


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 30


Information Gain

Ø Information gain measures the reduction of uncertainty given some


feature and it is also a deciding factor for which attribute should be
selected as a decision node or root node.

𝑰𝒏𝒇𝒐𝒓𝒎𝒂𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏 = 𝑬 𝒀 − 𝑬(𝒀|𝑿)

Ø It is just entropy of the full dataset – entropy of the dataset given some
feature.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 31


Information Gain

Ø To understand this better let’s consider an example: Suppose our entire


population has a total of 30 instances. The dataset is to predict whether
the person will go to the gym or not. Let’s say 16 people go to the gym
and 14 people don’t.

Ø Now we have two features to predict whether he/she will go to the gym
or not.
• Feature 1 is “Energy” which takes two values “high” and “low”
• Feature 2 is “Motivation” which takes 3 values “No motivation”,
“Neutral” and “Highly motivated”.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 32


Information Gain

Ø Let’s see how our decision tree


will be made using these 2
features.

Ø We’ll use information gain to


decide which feature should be
the root node and which feature
should be placed after the split.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 33


Information Gain

Ø Let’s calculate the entropy

Ø To see the weighted average of entropy of


each node we will do as follows:

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 34


Information Gain

Ø Now we have the value of E(Parent) and


E(Parent|Energy), information gain will be:

Our parent entropy was near 0.99 and after


looking at this value of information gain, we
can say that the entropy of the dataset will
decrease by 0.37 if we make “Energy” as our
root node.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 35


Information Gain

Ø Similarly, we will do this with the other


feature “Motivation” and calculate its
information gain.
Ø 𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 = 0.99
Ø 𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 𝑀𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 =# 𝑁𝑜 𝑚𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛# =
$ $ ' '
− % 𝑙𝑜𝑔& % − % 𝑙𝑜𝑔& % = 0.54
Ø 𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 𝑀𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 =# 𝑁𝑒𝑢𝑡𝑟𝑎𝑙 # =
( ( * *
− ') 𝑙𝑜𝑔& ') − ') 𝑙𝑜𝑔& ') = 0.97
Ø 𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 𝑀𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 =# 𝐻𝑖𝑔ℎ 𝑚𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛# =
+ + $ $
− '& 𝑙𝑜𝑔& '& − '& 𝑙𝑜𝑔& '& = 0.98

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 36


Information Gain

Ø To see the weighted average of entropy of


each node we will do as follows:
%
Ø 𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 𝑀𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 = − 0.54 +
,)
') '&
,)
0.97 + ,)
0.98 = 0.86

Ø 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝐺𝑎𝑖𝑛 = 𝑃 𝑃𝑎𝑟𝑒𝑛𝑡 −


𝐸 𝑃𝑎𝑟𝑒𝑛𝑡 𝑀𝑜𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 = 0.99 − 0.86 =
0.13

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 37


Information Gain

Ø We now see that the “Energy” feature gives more reduction which is 0.37
than the “Motivation” feature. Hence we will select the feature which has
the highest information gain and then split the node based on that
feature.

Ø In this example “Energy” will be our root node and we’ll do the same for
sub-nodes. Here we can see that when the energy is “high” the entropy is
low and hence we can say a person will definitely go to the gym if he has
high energy, but what if the energy is low? We will again split the node
based on the new feature which is “Motivation”.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 38


Information Gain
Dataset for implementation of Decision Tree Model

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 39


Information Gain
Dataset for implementation of Decision Tree Model

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 40


Information Gain
Dataset for implementation of Decision Tree Model

For Left Node

For Right Node

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 41


Content

ØWhat is a Decision Tree?


ØDecision Tree Terminologies
ØHow decision tree algorithms work?
ØEntropy
ØHow do Decision Trees use Entropy?
ØInformation Gain
ØPruning

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 42


Pruning
When to Stop Splitting?
Ø There are many ways to tackle this problem through hyperparameter
tuning. We can set the maximum depth of our decision tree using the
max_depth parameter.

Ø The more the value of max_depth, the more complex your tree will be.
The training error will off-course decrease if we increase
the max_depth value but when our test data comes into the picture, we
will get a very bad accuracy.

Ø Hence you need a value that will not overfit as well as underfit our data
and for this, you can use GridSearchCV.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 43


Pruning
When to Stop Splitting?
Ø Another way is to set the minimum number of samples for each spilt. It is
denoted by min_samples_split. Here we specify the minimum number of
samples required to do a spilt.

Ø For example, we can use a minimum of 10 samples to reach a decision.


That means if a node has less than 10 samples then using this parameter,
we can stop the further splitting of this node and make it a leaf node.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 44


Pruning
When to Stop Splitting?
There are more hyperparameters such as :

o min_samples_leaf – represents the minimum number of samples


required to be in the leaf node. The more you increase the number, the
more is the possibility of overfitting.

o max_features – it helps us decide what number of features to consider


when looking for the best split.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 45


Pruning

Ø Pruning is another method that can help us avoid overfitting. It helps in


improving the performance of the Decision tree by cutting the nodes or
sub-nodes which are not significant. Additionally, it removes the branches
which have very low importance.
Ø There are mainly 2 ways for pruning:
• Pre-pruning – we can stop growing the tree earlier, which means we
can prune/remove/cut a node if it has low importance while
growing the tree.
• Post-pruning – once our tree is built to its depth, we can start
pruning the nodes based on their significance.

Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 46


Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 47
Dr. Oybek Eraliyev Class: Artificial Intelligence SOC4040 48

You might also like