Business Data Mining Week 10
Business Data Mining Week 10
1. Initialization: C4.5 starts with the entire dataset and considers all the attributes as
potential splitting criteria.
2. Attribute Selection: It evaluates each attribute to find the one that best splits the dataset
into classes, using measures like information gain or gain ratio based on entropy. The
attribute with the highest information gain (or gain ratio) is selected as the splitting criterion
for the current node.
3. Splitting: The dataset is divided into subsets based on the chosen attribute. Each subset
corresponds to a different value of the selected attribute.
4. Recursion: Steps 2 and 3 are recursively applied to each subset until one of the stopping
conditions is met. Stopping conditions may include reaching a certain tree depth, having all
instances belong to the same class, or no attributes left to split on.
5. Pruning: After the tree is fully grown, post-pruning techniques like reduced-error pruning
or cost-complexity pruning may be applied to avoid overfitting.
Pragalath EA2252001010013 1
Advantages of C4.5:
Limitations of C4.5:
Practical Implications:
- Feature Selection: By examining which attributes are selected for splitting, businesses
can gain insights into the most influential factors affecting their outcomes.
Decision rules can be extracted directly from the decision tree structure. Each path from the
root to a leaf node represents a decision rule. The conditions along the path correspond to
the attribute-value pairs that lead to a particular class assignment. These rules can then be
translated into actionable guidelines for business decision-making.
In summary, the C4.5 algorithm is a powerful tool for business data mining, offering
advantages like interpretability and feature selection. However, it's essential to be aware of
its limitations and potential pitfalls, such as overfitting and computational complexity, to
use it effectively in real-world applications.
Pragalath EA2252001010013 2
Get unlimited access to the best of Medium for less than $1/week. Become a member
478 3
Decision Trees
A Decision Tree looks something like this flowchart. Let’s say you’d like to
plan your activities for today but you are introduced to some conditions
which would influence your decision.
In the above figure, we notice that one of the major factors which influences
the decision is Parents Visiting. So, if it is true then a quick decision is made
and we choose for going to the Cinema. What if they don’t visit?
Information Gain
If you have acquired information overtime which helps you to accurately
predict if something is going to happen, then the information regarding the
event which you have predicted is not new information. But, if the situation
goes South and an unexpected outcome occurs, it counts as useful and
necessary information.
The more you know about a topic, the less new information you are apt to get
about it. To be more concise: If you know an event is very probable, it is no
surprise when it happens, that is, it gives you little information that it actually
happened.
From the above statement we can formulate that the amount of information
gained is inversely proportional to the probability of an event happening. We
can also say that as the Entropy increases the information gain decreases.
This is because Entropy refers to the probability of an event.
Say we are looking at a coin toss. The probability of expecting any side of a
fair coin is 50%. If the coin is unfair such that either the probability of
acquiring a HEAD or TAIL is 1.00 then we say that the entropy is minimum
because without any sort of trials we can predict the outcome of the coin
toss.
In the plotted graph below, we notice that the maximum amount of
information gained due to maximised uncertainty of a particular event, is
when the probability of each of the events is equal. Here, p=q=0.5p=q=0.5
becomes easier.
So, we check every node against every splitting possibility. Information Gain
Ratio is the ratio of observations to the total number of observations (m/N =
p) and (n/N = q) where m+n=Nm+n=N and p+q=1p+q=1. After splitting if the
entropy of the next node is lesser than the entropy before splitting and if this
value is the least as compared to all possible test-cases for splitting, then the
node is split into its purest constituents.
Pruning
The Decision Tree in our original example is quite simple, but it is not so
when the dataset is huge and there are more variables to take into
consideration. This is where Pruning is required. Pruning refers to the
removal of those branches in our decision tree which we feel do not
contribute significantly to our decision process.
Let’s assume that our example data has a variable called Vehicle which
relates to or is derivative of the condition Money when it has the value Rich.
Now if Vehicle is Available, we go for Shopping via Car but if it is not
available we go shopping through any other means of transport. But in the
end we go for Shopping.
This implies that the Vehicle variable is not of much significance and can be
ruled out while constructing a Decision Tree.
2. For each attribute a, find the normalised information gain ratio from
splitting on a.
3. Let a_best be the attribute with the highest normalized information gain.
We should also keep in mind that C4.5 is not the best algorithm out there but
it does certainly prove to be useful in certain cases.
39
All the samples in the list belong to the same class. When this happens, it simply
creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a
decision node higher up the tree using the expected value of the class.
· For each attribute a, find the normalized information gain ratio from splitting
on a.
· Let a_best be the attribute with the highest normalized information gain.
· Recurse on the sublists obtained by splitting on a_best, and add those nodes as
children of node.
There are 8 decisions for weak wind, and 6 decisions for strong wind.
let’s consider gain as splitting criterion and request you to please follow
same steps with Gain Ratio.
Open in app
Search Write
If we will use gain, then outlook will be the root node because it has the
highest gain value.
Performs similar steps for all attributes over outlook and the resultant tree
looks like:
https://sefiks.com/2018/05/13/a-step-by-step-c4-5-decision-tree-example/
Limitations:
The limitations of C4. 5 is its information entropy, it gives poor results for
larger distinct attributes.
References:
Decision Tree
Last Updated : 17 May, 2024
Decision trees are a popular and powerful tool used in various fields such as
machine learning, data mining, and statistics. They provide a clear and
intuitive way to make decisions based on data by modeling the relationships
between different variables. This article is all about what decision trees are,
how they work, their advantages and disadvantages, and their applications.
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
information gain, the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the
selected attribute.
3. Repeating the Process: The process is repeated recursively for each
subset, creating a new internal node or leaf node until a stopping criterion
is met (e.g., all instances in a node belong to the same class or a
predefined depth is reached).
Pruning
To overcome overfitting, pruning techniques are used. Pruning reduces the
size of the tree by removing nodes that provide little power in classifying
instances. There are two main types of pruning:
Pre-pruning (Early Stopping): Stops the tree from growing once it meets
certain criteria (e.g., maximum depth, minimum number of samples per
leaf).
Post-pruning: Removes branches from a fully grown tree that do not
provide significant power.
Anshul Saini
31 May, 2024 • 13 min read
Introduction
Decision trees are a popular machine learning algorithm that can be used for
both regression and classification tasks. They are easy to understand,
interpret, and implement, making them an ideal choice for beginners in the
field of machine learning. In this comprehensive guide, we will cover all
tsil gnidaeR
aspects of the decision tree algorithm, including the working principles,
different types of decision trees, the process of building decision trees, and
how to evaluate and optimize decision trees. By the end of this article, you will
have a complete understanding of decision trees and decision tree examples
and how they can be used to solve real-world problems. Please check the
decision tree full course tutorial for FREE given below.
This article was published as a part of the Data Science Blogathon!
Table of contents
Root Node: The initial node at the beginning of a decision tree, where the
entire population or dataset starts dividing based on various features or
conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions or
conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating
the final classification or outcome. Leaf nodes are also referred to as
terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a
sub-section of a decision tree is referred to as a sub-tree. It represents a
specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a
decision tree to prevent overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire decision tree is referred to
as a branch or sub-tree. It represents a specific path of decisions and
outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-
nodes is known as a parent node, and the sub-nodes emerging from it are
referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or
further decisions based on that condition.
Decision trees are upside down which means the root is at the top and then
this root is split into various several nodes. Decision trees are nothing but a
bunch of if-else statements in layman terms. It checks if the condition is true
and if it is then it goes to the next node attached to that decision.
In the below diagram the tree will first ask what is the weather? Is it sunny,
cloudy, or rainy? If yes then it will go to the next feature which is humidity and
wind. It will again check if there is a strong wind or weak, if it’s a weak wind
and it’s rainy then the person may go and play.
Did you notice anything in the above flowchart? We see that if the weather is
cloudy then we must go to play. Why didn’t it split more? Why did it stop
there?
To answer this question, we need to know about few more concepts like
entropy, information gain, and Gini index. But in simple terms, I can say here
that the output for the training dataset is always yes for cloudy weather, since
there is no disorderliness here we don’t need to split the node further.
The goal of machine learning is to decrease uncertainty or disorders from the
dataset and for this, we use decision trees.
What is Decision Tree? [ANowStep-by-Step
you must be thinking
Guide]how do I know what should be the root node? what
should be the decision node? when should I stop splitting? To decide this,
there is a metric called “Entropy” which is the amount of uncertainty in the
dataset.
Entropy
What is Decision Tree? [AEntropy is nothing but
Step-by-Step the uncertainty in our dataset or measure of disorder.
Guide]
Let me try to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the
second is “Titanic” and now everyone has to tell their choice. After everyone
gives their answer we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes.
Which movie do we watch now? Isn’t it hard to choose 1 movie now because
the votes for both the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal number of votes for
both the movies, and we can’t really decide which movie we should watch. It
would have been much easier if the votes for “Lucy” were 8 and for “Titanic” it
was 2. Here we could easily say that the majority of votes are for “Lucy” hence
everyone will be watching this movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:
Here,
p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more
purity than right node since left node has a greater number of “yes” and it is
easy to decide here.
Always remember that the higher the Entropy, the lower will be the purity and
the higher will be the impurity.
As mentioned earlier the goal of machine learning is to decrease the
uncertainty or impurity in the dataset, here by using the entropy we are
getting the impurity of a particular node, we don’t know if the parent entropy
or the entropy of a particular node has decreased or not.
What is Decision Tree? [AForStep-by-Step
this, we bring aGuide]
new metric called “Information gain” which tells us how
much the parent entropy has decreased after splitting it with some feature.
Information Gain
Information gain measures the reduction of uncertainty given some feature
and it is also a deciding factor for which attribute should be selected as a
decision node or root node.
It is just entropy of the full dataset – entropy of the dataset given some
feature.
To understand this better let’s consider an example:Suppose our entire
population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14
people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral”
and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which
feature should be placed after the split.
Now we have the value of E(Parent) and E(Parent|Energy), information gain will
be:
Our parent entropy was near 0.99 and after looking at this value of information
gain, we can say that the entropy of the dataset will decrease by 0.37 if we
make “Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.
Pruning
Pruning is another method that can help us avoid overfitting. It helps in
improving the performance of the Decision tree by cutting the nodes or sub-
What is Decision Tree? [Anodes which are notGuide]
Step-by-Step significant. Additionally, it removes the branches which
have very low importance.
There are mainly 2 ways for pruning:
Pre-pruning – we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
Post-pruning – once our tree is built to its depth, we can start pruning the
nodes based on their significance.
Conclusion
To summarize, in this article we learned about decision trees. On what basis
the tree splits the nodes and how to can stop overfitting. why linear regression
doesn’t work in the case of classification problems.To check out the full
implementation of decision trees please refer to my Github repository. You can
master all the Data Science topics with our Black Belt Plus Program with out
50+ projects and 20+ tools. We hope you like this article, and get clear
understanding on decision tree algorithm, decision tree examples that will help
you to get clear understanding .Start your learning journey today!