0% found this document useful (0 votes)
39 views30 pages

Business Data Mining Week 10

Business Data Mining

Uploaded by

pm6566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views30 pages

Business Data Mining Week 10

Business Data Mining

Uploaded by

pm6566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Week 10 - LAQ's

Explain how the C4.5 algorithm works to create a decision tree


for business data mining, discussing its key steps, advantages,
limitations, and practical implications, and elucidate how
decision rules are generated from decision trees to assist business
decision-making.
-----------------------------------------------------
The C4.5 algorithm is a popular decision tree algorithm used for data mining and
classification tasks. Here's a detailed explanation of how it works, its key steps, advantages,
limitations, practical implications, and how decision rules are generated from decision trees
to assist business decision-making:

How C4.5 Algorithm Works:

1. Initialization: C4.5 starts with the entire dataset and considers all the attributes as
potential splitting criteria.

2. Attribute Selection: It evaluates each attribute to find the one that best splits the dataset
into classes, using measures like information gain or gain ratio based on entropy. The
attribute with the highest information gain (or gain ratio) is selected as the splitting criterion
for the current node.

3. Splitting: The dataset is divided into subsets based on the chosen attribute. Each subset
corresponds to a different value of the selected attribute.

4. Recursion: Steps 2 and 3 are recursively applied to each subset until one of the stopping
conditions is met. Stopping conditions may include reaching a certain tree depth, having all
instances belong to the same class, or no attributes left to split on.

5. Pruning: After the tree is fully grown, post-pruning techniques like reduced-error pruning
or cost-complexity pruning may be applied to avoid overfitting.

Pragalath EA2252001010013 1
Advantages of C4.5:

- Handles both categorical and numerical attributes.


- Can handle missing attribute values by estimating them.
- Generates human-readable decision trees that are easy to interpret.
- Robust to noise and irrelevant attributes.
- Can deal with multi-class classification problems.

Limitations of C4.5:

- Can be computationally expensive, especially with large datasets or many attributes.


- Prone to overfitting, especially with noisy data.
- May create biased trees if the attributes have different numbers of values.
- Not suitable for regression tasks (predicting continuous values).
- Decision trees can be unstable with small variations in the data.

Practical Implications:

- Interpretability: Decision trees generated by C4.5 are easily interpretable, making it


useful for explaining the reasoning behind classification decisions to stakeholders.

- Feature Selection: By examining which attributes are selected for splitting, businesses
can gain insights into the most influential factors affecting their outcomes.

- Automation: Decision trees can automate decision-making processes, allowing businesses


to make quick and consistent decisions based on historical data.

Generating Decision Rules:

Decision rules can be extracted directly from the decision tree structure. Each path from the
root to a leaf node represents a decision rule. The conditions along the path correspond to
the attribute-value pairs that lead to a particular class assignment. These rules can then be
translated into actionable guidelines for business decision-making.

In summary, the C4.5 algorithm is a powerful tool for business data mining, offering
advantages like interpretability and feature selection. However, it's essential to be aware of
its limitations and potential pitfalls, such as overfitting and computational complexity, to
use it effectively in real-world applications.

Pragalath EA2252001010013 2
Get unlimited access to the best of Medium for less than $1/week. Become a member

What is the C4.5 algorithm and how


does it work?
Sumit Saha · Follow
Published in Towards Data Science · 5 min read · Aug 20, 2018

478 3

The C4.5 algorithm is used in Data Mining as a Decision Tree Classifier


which can be employed to generate a decision, based on a certain sample of
data (univariate or multivariate predictors).
So, before we dive straight into C4.5, let’s discuss a little about Decision Trees
and how they can be used as classifiers.

Decision Trees

Example of a Decision Tree

A Decision Tree looks something like this flowchart. Let’s say you’d like to
plan your activities for today but you are introduced to some conditions
which would influence your decision.

In the above figure, we notice that one of the major factors which influences
the decision is Parents Visiting. So, if it is true then a quick decision is made
and we choose for going to the Cinema. What if they don’t visit?

This opens up an array of other conditions. Now, if the Weather is Sunny or


Rainy, we either go for Playing Tennis or Stay In respectively. But, if it is
Windy, I check for how much Money I possess. If I have a have a healthy
amount to spend i.e. Rich, I go for Shopping or else I go for a Cinema.
Remember, that the root of the tree is always the variable which has the
minimum value to a cost function. In this example, the probability of
Parents Visiting is 50% each leading to a easier decision making if you think
about it. But what if Weather was selected as the root? Then we’d have
33.33% chance of each happening which can increase our chances of taking
a wrong decision due to the availability of more test-cases to consider.

It would be more understandable if we go through the concept of


Information Gain and Entropy.

Information Gain
If you have acquired information overtime which helps you to accurately
predict if something is going to happen, then the information regarding the
event which you have predicted is not new information. But, if the situation
goes South and an unexpected outcome occurs, it counts as useful and
necessary information.

Similar is the concept of Information gain.

The more you know about a topic, the less new information you are apt to get
about it. To be more concise: If you know an event is very probable, it is no
surprise when it happens, that is, it gives you little information that it actually
happened.

From the above statement we can formulate that the amount of information
gained is inversely proportional to the probability of an event happening. We
can also say that as the Entropy increases the information gain decreases.
This is because Entropy refers to the probability of an event.

Say we are looking at a coin toss. The probability of expecting any side of a
fair coin is 50%. If the coin is unfair such that either the probability of
acquiring a HEAD or TAIL is 1.00 then we say that the entropy is minimum
because without any sort of trials we can predict the outcome of the coin
toss.
In the plotted graph below, we notice that the maximum amount of
information gained due to maximised uncertainty of a particular event, is
when the probability of each of the events is equal. Here, p=q=0.5p=q=0.5

E = entropy of the system event

p = probability of HEAD as an outcome

q = probability of TAIL as an outcome


In the case is Decision Trees, it is essential that the node are aligned as such
Open in app
that the entropy decreases with splitting downwards. This basically means
that the more splitting
Search is done appropriately, coming to a definite decision
Write

becomes easier.

So, we check every node against every splitting possibility. Information Gain
Ratio is the ratio of observations to the total number of observations (m/N =
p) and (n/N = q) where m+n=Nm+n=N and p+q=1p+q=1. After splitting if the
entropy of the next node is lesser than the entropy before splitting and if this
value is the least as compared to all possible test-cases for splitting, then the
node is split into its purest constituents.

In our example we find the Parents Visiting decreases entropy at a larger


scale as compared to the other options. Hence, we go with that option.

Pruning
The Decision Tree in our original example is quite simple, but it is not so
when the dataset is huge and there are more variables to take into
consideration. This is where Pruning is required. Pruning refers to the
removal of those branches in our decision tree which we feel do not
contribute significantly to our decision process.

Let’s assume that our example data has a variable called Vehicle which
relates to or is derivative of the condition Money when it has the value Rich.
Now if Vehicle is Available, we go for Shopping via Car but if it is not
available we go shopping through any other means of transport. But in the
end we go for Shopping.

This implies that the Vehicle variable is not of much significance and can be
ruled out while constructing a Decision Tree.

The concept of Pruning enables us to avoid Overfitting of the regression or


classification model so that for a small sample of data, the errors in
measurement are not included while generating the model.
Pseudocode
1. Check for the above base cases.

2. For each attribute a, find the normalised information gain ratio from
splitting on a.

3. Let a_best be the attribute with the highest normalized information gain.

4. Create a decision node that splits on a_best.

5. Recur on the sublists obtained by splitting on a_best, and add those


nodes as children of node.

Advantages of C4.5 over other Decision Tree systems:

1. The algorithm inherently employs Single Pass Pruning Process to


Mitigate overfitting.

2. It can work with both Discrete and Continuous Data

3. C4.5 can handle the issue of incomplete data very well

We should also keep in mind that C4.5 is not the best algorithm out there but
it does certainly prove to be useful in certain cases.

Decision Tree Data Science


Get unlimited access to the best of Medium for less than $1/week. Become a member

Understanding C4.5 Decision tree


algorithm
Sumitkrsharma · Follow
3 min read · Jan 25, 2022

39

C4.5 algorithm is improvement over ID3 algorithm, where “C” is shows


algorithm is written in C and 4.5 specifics version of algorithm. splitting
criterion used by C4.5 is the normalized information gain (difference in
entropy). The attribute with the highest normalized information gain is
chosen to make the decision. The C4.5 algorithm then recurse on the
partitioned sub lists.
In-depth Understand of algorithm:

This algorithm has a few base cases.

All the samples in the list belong to the same class. When this happens, it simply
creates a leaf node for the decision tree saying to choose that class.

None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.

Instance of previously-unseen class encountered. Again, C4.5 creates a decision


node higher up the tree using the expected value.
Steps in algorithm:

· Check for the above base cases.

· For each attribute a, find the normalized information gain ratio from splitting
on a.

· Let a_best be the attribute with the highest normalized information gain.

· Create a decision node that splits on a_best.

· Recurse on the sublists obtained by splitting on a_best, and add those nodes as
children of node.

Let’s understand numerical working of C4.5 algorithm on example dataset


:
Note: Logarithm with base 2 is used in al calculations.
https://sefiks.com/2018/05/13/a-step-by-step-c4-5-decision-tree-example/

Entropy(Decision) = ∑ — p(I) . log p(I) = — p(Yes) . log p(Yes) — p(No) .


log2(No) = — (9/14) . log(9/14) — (5/14) . log(5/14) = 0.940

Here, we need to calculate gain ratios instead of gains.

GainRatio(A) = Gain(A) / SplitInfo(A)

SplitInfo(A) = -∑ |Dj|/|D| x log|Dj|/|D|

Let’s calculate for Wind Attribute:

Gain(Decision, Wind) = Entropy(Decision) — ∑ ( p(Decision|Wind) .


Entropy(Decision|Wind) )
Gain(Decision, Wind) = Entropy(Decision) — [ p(Decision|Wind=Weak) .
Entropy(Decision|Wind=Weak) ] + [ p(Decision|Wind=Strong) .
Entropy(Decision|Wind=Strong) ]

Entropy(Decision|Wind=Weak) = — p(No) . logp(No) — p(Yes) . logp(Yes) = —


(2/8) . log(2/8) — (6/8) . log(6/8) = 0.811

Entropy(Decision|Wind=Strong) = — (3/6) . log(3/6) — (3/6) . log(3/6) = 1

Gain(Decision, Wind) = 0.940 — (8/14).(0.811) — (6/14).(1) = 0.940–0.463–0.428


= 0.049

There are 8 decisions for weak wind, and 6 decisions for strong wind.

SplitInfo(Decision, Wind) = -(8/14).log(8/14) — (6/14).log(6/14) = 0.461 + 0.524


= 0.985

GainRatio(Decision, Wind) = Gain(Decision, Wind) / SplitInfo(Decision,


Wind) = 0.049 / 0.985 = 0.049

Similarly, for wind, Outlook, Humidity<> 80 and Temperature <>83

let’s consider gain as splitting criterion and request you to please follow
same steps with Gain Ratio.
Open in app

Search Write

If we will use gain, then outlook will be the root node because it has the
highest gain value.

Performs similar steps for all attributes over outlook and the resultant tree
looks like:

https://sefiks.com/2018/05/13/a-step-by-step-c4-5-decision-tree-example/

Improvements in C4.5 over ID3:


· Handling both continuous and discrete
· Handling training data with missing attribute values

· Handling attributes with differing costs.

· Pruning trees after creation

Limitations:
The limitations of C4. 5 is its information entropy, it gives poor results for
larger distinct attributes.

In this post we understand C4.5 a decision tree algorithm in detail manner,


with its definition, pseudo-code and numerical example. In coming posts we
follow our path of learning and explore more machine learning algorithms.

Thank you learners !!!

References:

Wikipedia, the free encyclopedia


The New York Stock Exchange Building is the headquarters of the
New York Stock Exchange (NYSE). It is composed of two…
en.wikipedia.org

A Step By Step C4.5 Decision Tree Example - Sefik Ilkin Serengil


Support me on Patreon Haven't you subscribe my channel yet?
Follow me on Twitter Follow @serengil Decision trees are…
sefiks.com
Python for Machine Learning Machine Learning with R Machine Learning Algorithms EDA Math for Machine Lear

Decision Tree
Last Updated : 17 May, 2024
Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and
intuitive way to make decisions based on data by modeling the relationships
between different variables. This article is all about what decision trees are,
how they work, their advantages and disadvantages, and their applications.

What is a Decision Tree?


A decision tree is a flowchart-like structure used to make decisions or
predictions. It consists of nodes representing decisions or tests on attributes,
branches representing the outcome of these decisions, and leaf nodes
representing final outcomes or predictions. Each internal node corresponds to
a test on an attribute, each branch corresponds to the result of the test, and
each leaf node corresponds to a class label or a continuous value.

Structure of a Decision Tree


1. Root Node: Represents the entire dataset and the initial decision to be
made.
2. Internal Nodes: Represent decisions or tests on attributes. Each internal
node has one or more branches.
3. Branches: Represent the outcome of a decision or test, leading to another
node.
4. Leaf Nodes: Represent the final decision or prediction. No further splits
occur at these nodes.

How Decision Trees Work?


The process of creating a decision tree involves:

1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
information gain, the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the
selected attribute.
3. Repeating the Process: The process is repeated recursively for each
subset, creating a new internal node or leaf node until a stopping criterion
is met (e.g., all instances in a node belong to the same class or a
predefined depth is reached).

Metrics for Splitting


Gini Impurity: Measures the likelihood of an incorrect classification of a
new instance if it was randomly classified according to the distribution of
classes in the dataset.
Gini = 1– ∑ni=1 (pi )2 , where pi​is the probability of an instance being
​ ​

classified into a particular class.


Entropy: Measures the amount of uncertainty or impurity in the dataset.
Entropy = − ∑ni=1 pi log2 (pi ), where pi​is the probability of an instance
​ ​ ​ ​

being classified into a particular class.


Information Gain: Measures the reduction in entropy or Gini impurity after
a dataset is split on an attribute.
InformationGain = Entropyparent – ∑ni=1 ( ∣D

i∣
∣D∣
​ ∗ Entropy(Di )), where Di​is

​ ​

the subset of D after splitting by an attribute.

Advantages of Decision Trees


Simplicity and Interpretability: Decision trees are easy to understand and
interpret. The visual representation closely mirrors human decision-
making processes.
Versatility: Can be used for both classification and regression tasks.
No Need for Feature Scaling: Decision trees do not require normalization
or scaling of the data.
Handles Non-linear Relationships: Capable of capturing non-linear
relationships between features and target variables.

Disadvantages of Decision Trees


Overfitting: Decision trees can easily overfit the training data, especially if
they are deep with many nodes.
Instability: Small variations in the data can result in a completely different
tree being generated.
Bias towards Features with More Levels: Features with more levels can
dominate the tree structure.

Pruning
To overcome overfitting, pruning techniques are used. Pruning reduces the
size of the tree by removing nodes that provide little power in classifying
instances. There are two main types of pruning:

Pre-pruning (Early Stopping): Stops the tree from growing once it meets
certain criteria (e.g., maximum depth, minimum number of samples per
leaf).
Post-pruning: Removes branches from a fully grown tree that do not
provide significant power.

Applications of Decision Trees


Business Decision Making: Used in strategic planning and resource
allocation.
Healthcare: Assists in diagnosing diseases and suggesting treatment
plans.
Finance: Helps in credit scoring and risk assessment.
Marketing: Used to segment customers and predict customer behavior.

Introduction to Decision Tree


Decision Tree in Machine Learning
What is Decision Tree? [A Step-by-Step Guide]

Master Generative AI: Your step-by-step guide to become a certified GenAI


expert
Download Roadmap P

Home Algorithm What is Decision Tree? [A Step-by-Step Guide]

Anshul Saini
31 May, 2024 • 13 min read

Introduction
Decision trees are a popular machine learning algorithm that can be used for
both regression and classification tasks. They are easy to understand,
interpret, and implement, making them an ideal choice for beginners in the
field of machine learning. In this comprehensive guide, we will cover all

tsil gnidaeR
aspects of the decision tree algorithm, including the working principles,
different types of decision trees, the process of building decision trees, and
how to evaluate and optimize decision trees. By the end of this article, you will
have a complete understanding of decision trees and decision tree examples
and how they can be used to solve real-world problems. Please check the
decision tree full course tutorial for FREE given below.
This article was published as a part of the Data Science Blogathon!

Table of contents

What is a Decision Tree?


What is Decision Tree? [A Step-by-Step
Decision TreeGuide]
Full Course | #1. Introduction to Decision Tree

A decision tree is a non-parametric supervised learning algorithm for


classification and regression tasks. It has a hierarchical tree structure
consisting of a root node, branches, internal nodes, and leaf nodes. Decision
trees are used for classification and regression tasks, providing easy-to-
understand models.
A decision tree is a hierarchical model used in decision support that depicts
decisions and their potential outcomes, incorporating chance events, resource
expenses, and utility. This algorithmic model utilizes conditional control
statements and is non-parametric, supervised learning, useful for both
classification and regression tasks. The tree structure is comprised of a root
node, branches, internal nodes, and leaf nodes, forming a hierarchical, tree-
like structure.
It is a tool that has applications spanning several different areas. Decision
trees can be used for classification as well as regression problems. The name
itself suggests that it uses a flowchart like a tree structure to show the
predictions that result from a series of feature-based splits. It starts with a
root node and ends with a decision made by leaves.

Decision Tree Terminologies


What is Decision Tree? [ABefore learning moreGuide]
Step-by-Step about decision trees let’s get familiar with some of the
terminologies:

Root Node: The initial node at the beginning of a decision tree, where the
entire population or dataset starts dividing based on various features or
conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions or
conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating
the final classification or outcome. Leaf nodes are also referred to as
terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a
sub-section of a decision tree is referred to as a sub-tree. It represents a
specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a
decision tree to prevent overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire decision tree is referred to
as a branch or sub-tree. It represents a specific path of decisions and
outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-
nodes is known as a parent node, and the sub-nodes emerging from it are
referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or
further decisions based on that condition.

Example of Decision Tree


Let’s understand decision trees with the help of an example:
What is Decision Tree? [A Step-by-Step Guide]

Decision trees are upside down which means the root is at the top and then
this root is split into various several nodes. Decision trees are nothing but a
bunch of if-else statements in layman terms. It checks if the condition is true
and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny,
cloudy, or rainy? If yes then it will go to the next feature which is humidity and
wind. It will again check if there is a strong wind or weak, if it’s a weak wind
and it’s rainy then the person may go and play.

Did you notice anything in the above flowchart? We see that if the weather is
cloudy then we must go to play. Why didn’t it split more? Why did it stop
there?
To answer this question, we need to know about few more concepts like
entropy, information gain, and Gini index. But in simple terms, I can say here
that the output for the training dataset is always yes for cloudy weather, since
there is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the
dataset and for this, we use decision trees.
What is Decision Tree? [ANowStep-by-Step
you must be thinking
Guide]how do I know what should be the root node? what
should be the decision node? when should I stop splitting? To decide this,
there is a metric called “Entropy” which is the amount of uncertainty in the
dataset.

How decision tree algorithms work?


Decision Tree algorithm works in simpler steps
Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
Branching Out: Based on the answer to that question, it divides the data
into smaller subsets, creating new branches. Each branch represents a
possible route through the tree.
Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.

Decision Tree Assumptions


Several assumptions are made to build effective models when creating
decision trees. These assumptions help guide the tree’s construction and
impact its performance. Here are some common assumptions and
considerations when creating decision trees:
Binary Splits
Decision trees typically make binary splits, meaning each node divides the
data into two subsets based on a single feature or condition. This assumes
that each decision can be represented as a binary choice.
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is
divided into child nodes, and this process continues until a stopping criterion
is met. This assumes that data can be effectively subdivided into smaller, more
manageable subsets.
Feature Independence
Decision trees often assume that the features used for splitting nodes are
independent. In practice, feature independence may not hold, but decision
trees can still perform well if features are correlated.
Homogeneity
What is Decision Tree? [ADecision trees aim Guide]
Step-by-Step to create homogeneous subgroups in each node, meaning
that the samples within a node are as similar as possible regarding the target
variable. This assumption helps in achieving clear decision boundaries.
Top-Down Greedy Approach
Decision trees are constructed using a top-down, greedy approach, where
each split is chosen to maximize information gain or minimize impurity at the
current node. This may not always result in the globally optimal tree.
Categorical and Numerical Features
Decision trees can handle both categorical and numerical features. However,
they may require different splitting strategies for each type.
Overfitting
Decision trees are prone to overfitting when they capture noise in the data.
Pruning and setting appropriate stopping criteria are used to address this
assumption.
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy to
evaluate how well a split separates classes. The choice of impurity measure
can impact tree construction.
No Missing Values
Decision trees assume that there are no missing values in the dataset or that
missing values have been appropriately handled through imputation or other
methods.
Equal Importance of Features
Decision trees may assume equal importance for all features unless feature
scaling or weighting is applied to emphasize certain features.
No Outliers
Decision trees are sensitive to outliers, and extreme values can influence their
construction. Preprocessing or robust methods may be needed to handle
outliers effectively.
Sensitivity to Sample Size
Small datasets may lead to overfitting, and large datasets may result in overly
complex trees. The sample size and tree depth should be balanced.

Entropy
What is Decision Tree? [AEntropy is nothing but
Step-by-Step the uncertainty in our dataset or measure of disorder.
Guide]
Let me try to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the
second is “Titanic” and now everyone has to tell their choice. After everyone
gives their answer we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes.
Which movie do we watch now? Isn’t it hard to choose 1 movie now because
the votes for both the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal number of votes for
both the movies, and we can’t really decide which movie we should watch. It
would have been much easier if the votes for “Lucy” were 8 and for “Titanic” it
was 2. Here we could easily say that the majority of votes are for “Lucy” hence
everyone will be watching this movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:

Here,
p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example

How do Decision Trees use Entropy?


Now we know what entropy is and what is its formula, Next, we need to know
that how exactly does it work in this algorithm.
Entropy basically measures the impurity of a node. Impurity is the degree of
randomness; it tells how random our data is. Apure sub-splitmeans that either
you should be getting “yes”, or you should be getting “no”.
Supposea featurehas 8 “yes” and 4 “no” initially, after the first split the left
node gets 5 ‘yes’ and 2 ‘no’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. In order to make a decision tree, we need to
calculate the impurity of each split, and when the purity is 100%, we make it as
a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for
Entropy formula.
What is Decision Tree? [A Step-by-Step Guide]

For feature 3,

We can clearly see from the tree itself that left node has low entropy or more
purity than right node since left node has a greater number of “yes” and it is
easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and
the higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the
uncertainty or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if the parent entropy
or the entropy of a particular node has decreased or not.
What is Decision Tree? [AForStep-by-Step
this, we bring aGuide]
new metric called “Information gain” which tells us how
much the parent entropy has decreased after splitting it with some feature.

Information Gain
Information gain measures the reduction of uncertainty given some feature
and it is also a deciding factor for which attribute should be selected as a
decision node or root node.

It is just entropy of the full dataset – entropy of the dataset given some
feature.
To understand this better let’s consider an example:Suppose our entire
population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14
people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral”
and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which
feature should be placed after the split.

Image Source: Author


Let’s calculate the entropy
What is Decision Tree? [ATo Step-by-Step
see the weightedGuide]
average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will
be:

Our parent entropy was near 0.99 and after looking at this value of information
gain, we can say that the entropy of the dataset will decrease by 0.37 if we
make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.

Image Source: Author


Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain


will be:
What is Decision Tree? [AWeStep-by-Step
now see that theGuide]
“Energy” feature gives more reduction which is 0.37 than
the “Motivation” feature. Hence we will select the feature which has the
highest information gain and then split the node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub-
nodes. Here we can see that when the energy is “high” the entropy is low and
hence we can say a person will definitely go to the gym if he has high energy,
but what if the energy is low? We will again split the node based on the new
feature which is “Motivation”.

When to Stop Splitting?


You must be asking this question to yourself that when do we stop growing
our Decision tree? Usually, real-world datasets have a large number of
features, which will result in a large number of splits, which in turn gives a
huge tree. Such trees take time to build and can lead to overfitting. That
means the tree will give very good accuracy on the training dataset but will
give bad accuracy in test data.
There are many ways to tackle this problem through hyperparameter tuning.
We can set the maximum depth of our decision tree using themax_depth
parameter. The more the value of max_depth, the more complex your tree will
be. The training error will off-course decrease if we increase the max_depth
value but when our test data comes into the picture, we will get a very bad
accuracy. Hence you need a value that will not overfit as well as underfit our
data and for this, you can use GridSearchCV.
Another way is to set the minimum number of samples for each spilt. It is
denoted by min_samples_split. Here we specify the minimum number of
samples required to do a spilt. For example, we can use a minimum of 10
samples to reach a decision. That means if a node has less than 10 samples
then using this parameter, we can stop the further splitting of this node and
make it a leaf node.
There are more hyperparameters such as :
min_samples_leaf – represents the minimum number of samples required
to be in the leaf node. The more you increase the number, the more is the
possibility of overfitting.
max_features – it helps us decide what number of features to consider
when looking for the best split.
To read more about these hyperparameters you can read ithere.

Pruning
Pruning is another method that can help us avoid overfitting. It helps in
improving the performance of the Decision tree by cutting the nodes or sub-
What is Decision Tree? [Anodes which are notGuide]
Step-by-Step significant. Additionally, it removes the branches which
have very low importance.
There are mainly 2 ways for pruning:
Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the
nodes based on their significance.

Decision tree example


Suppose you wish to choose whether to go outside and play or not. You could
make a choice based on the weather. For that, here’s a decision tree:
Is the weather sunny?
Branch, indeed:
Next Node: How windy is it?
Yes, Branch: Remain indoors; it’s too windy for comfortable play.
No Branch: Go play; pleasant, sunny weather is ideal for outdoor recreation.
No. Next: Branch: Is it raining?
Yes, Branch: Remain indoors; playing outside is uncomfortable due to the rain.
No Branch: Go play! It’s gloomy but not raining, so it could be a nice day to be
outside.
Beyond predicting the weather, decision trees are utilized for a wide range of
tasks, such as identifying spam emails and forecasting loan approvals. They
are a popular option for many machine learning applications since they are
simple to comprehend and interpret.

Conclusion
To summarize, in this article we learned about decision trees. On what basis
the tree splits the nodes and how to can stop overfitting. why linear regression
doesn’t work in the case of classification problems.To check out the full
implementation of decision trees please refer to my Github repository. You can
master all the Data Science topics with our Black Belt Plus Program with out
50+ projects and 20+ tools. We hope you like this article, and get clear
understanding on decision tree algorithm, decision tree examples that will help
you to get clear understanding .Start your learning journey today!

Frequently Asked Questions


blogathon data science decision tree decision tree machine learning
machine learning What is decision tree

You might also like