Decision Tree
Decision Tree
Mining
Two forms of data analysis can be used to extract models describing important classes
or predict future data trends. These two forms are as follows:
1. Classification
2. Prediction
We use classification and prediction to extract a model, representing the data classes
to predict future data trends. Classification predicts the categorical labels of data with
the prediction models. This analysis provides us with the best understanding of the
data at a large scale.
Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky or a prediction model to
predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.
What is Classification?
Classification is to identify the category or the class label of a new observation. First, a
set of data is used as training data. The set of input data and the corresponding
outputs are given to the algorithm. So, the training data set includes the input data
and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical
formula, or a neural network. In classification, when unlabeled data is given to the
model, it should find the class to which it belongs. The new data provided to the model
is the test data set.
The bank needs to analyze whether giving a loan to a particular customer is risky or
not. For example, based on observable data for multiple loan borrowers, a
classification model may be established that forecasts credit risk. The data could track
job records, homeownership or leasing, years of residency, number, type of deposits,
historical credit ranking, etc. The goal would be credit ranking, the predictors would
be the other characteristics, and the data would represent a case for each consumer.
In this example, a model is constructed to find the categorical label. The labels are risky
or safe.
consisting of a root node, branches, internal nodes, and leaf nodes. Decision
trees are used for classification and regression tasks, providing easy-to-
regression problems. The name itself suggests that it uses a flowchart like a tree
structure to show the predictions that result from a series of feature-based splits.
It starts with a root node and ends with a decision made by leaves.
Decision Tree Terminologies
Before learning more about decision trees let’s get familiar with some of the
terminologies:
Root Node: The initial node at the beginning of a decision tree, where the
entire population or dataset starts dividing based on various features or
conditions.
Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions
or conditions within the tree.
Leaf Nodes: Nodes where further splitting is not possible, often indicating
the final classification or outcome. Leaf nodes are also referred to as
terminal nodes.
Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a
sub-section of a decision tree is referred to as a sub-tree. It represents a
specific portion of the decision tree.
Pruning: The process of removing or cutting down specific nodes in a
decision tree to prevent overfitting and simplify the model.
Branch / Sub-Tree: A subsection of the entire decision tree is referred to
as a branch or sub-tree. It represents a specific path of decisions and
outcomes within the tree.
Parent and Child Node: In a decision tree, a node that is divided into sub-
nodes is known as a parent node, and the sub-nodes emerging from it are
referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or further
decisions based on that condition.
Example of Decision Tree
Let’s understand decision trees with the help of an example:
Decision trees are upside down which means the root is at the top and then this
root is split into various several nodes. Decision trees are nothing but a bunch of
if-else statements in layman's terms.
How do decision tree algorithms work?
1. Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
2. Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
3. Branching Out: Based on the answer to that question, it divides the data
into smaller subsets, creating new branches. Each branch represents a
possible route through the tree.
4. Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.
Algorithm: Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples and their associated class labels.
attribute list, the set of candidate attributes.
Attribute selection method, a procedure to determine the splitting criterion that best
partitions the data tuples into individual classes.
This criterion includes a splitting attribute and either a splitting point or a splitting
subset.
Output:
A Decision Tree
Method
create a node N;
o Information Gain
o Gini Index
1. Information Gain:
Information Gain
= Entropy(S)- [(Weighted Avg) *Entropy(each fea
ture)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
Here,
p+ is the probability of positive class
p– is the probability of negative class
S is the subset of the training example
How do Decision Trees use Entropy?
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left
node gets 5 ‘yes’ and 2 ‘no ’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. To make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.
For feature 3,
We can clearly see from the tree itself that left node has low entropy or more
purity than right node since left node has a greater number of “yes” and it is easy
to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the
Information Gain
Information gain measures the reduction of uncertainty given some feature and it
is also a deciding factor for which attribute should be selected as a decision node
or root node.
It is just entropy of the full dataset – entropy of the dataset given some feature.
To understand this better let’s consider an example:Suppose our entire
population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14
people don’t
Now we have two features to predict whether he/she will go to the gym or not.
Feature 1 is “Energy” which takes two values “high” and “low”
Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral”
and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which
feature should be placed after the split.
Now we have the value of E(Parent) and E(Parent|Energy), information gain will
be:
Our parent entropy was near 0.99 and after looking at this value of information
gain, we can say that the entropy of the dataset will decrease by 0.37 if we make
“Energy” as our root node.
Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.
Let’s calculate the entropy here:
will be:
We now see that the “Energy” feature gives more reduction which is 0.37 than
the “Motivation” feature. Hence, we will select the feature which has the highest
information gain and then split the node based on that feature.
In this example “Energy” will be our root node and we’ll do the same for sub -
nodes. Here we can see that when the energy is “high” the entropy is low and
hence we can say a person will definitely go to the gym if he has high energy,
but what if the energy is low? We will again split the node based on the new
o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
1) For BPS=1,
If (Bps=1 and target =1)=8/10
if(Bps=1 and target=0)=2/10
Gini index PBPS1=1-{(PBPS1)2+(PBPS0)2
= 1-{{8/10)2+(2/10)2}
=0.32
2) if BPS=0,
If (BPS=0 and target=0)=4/4=1
If (BPS=0 and target=1)=0
Gini index PBPS0=1-{(1)-(0)}
= 1-1
=0
i) For HChol.=1
If (Hchol.=1 and target=1)=7/11
If (HChol.=1 and target=0)=4/11
Gini index = 1-[(7/11)2+(4/11)2]
= 0.46
ii) If HChol.=0
If (Hchol.=0 and target=1)=1/3
If (HChol.=0 and target=0)=2/3
Gini index= 1-[(1/3)2+(2/3)2]
= 0.55
HighBPS is used as the root node for constructing of Decision Tree and the further tree is
built.
Mtech 7.3.24
ANURAAGA DSC3002
12:33 PM
Anuraaga Nath
DIPAYAN DSC3004
12:33 PM
Dipayan Banerjee
Sanghamitra Majumder
12:33 PM
SANGHAMITRA MAJUMDER - Present Mam.
SUBARNA DSC3008
12:33 PM
Subarna Paul
ARPITA DSC3003
12:33 PM
Arpita De
snigdha chandra
12:33 PM
Snigdha Chandra
KUHELI DSC3005
12:33 PM
Kuheli Banik -11561723005
Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.
In decision trees, In order to fit the data (even noisy data), the model
keeps generating new nodes and ultimately the tree becomes too
complex to interpret. The decision tree predicts well for the training
data but can be inaccurate for new data. If a decision tree model is
allowed to train to its full potential, it can overfit the training
data.There are techniques to prevent the overfitting of decision trees
What is Pruning?
Pre-Pruning
Pre-Pruning, also known as ‘Early Stopping’ or ‘Forward Pruning’,
stops the growth of the decision tree — preventing it from reaching
its full depth. It stops the non-significant branches from generating
in a decision tree. Pre-Pruning involves the tuning of
the hyperparameters prior to training the model.
The hyperparameters that can be tuned for pre-pruning or early
stopping are max_depth, min_samples_leaf, and
min_samples_split.
Post-Pruning
You must be asking this question to yourself that when do we stop growing our
trees take time to build and can lead to overfitting. That means the tree will give
very good accuracy on the training dataset but will give bad accuracy in test
data.
There are many ways to tackle this problem through hyperparameter tuning. We
can set the maximum depth of our decision tree using themax_depth parameter.
The more the value of max_depth, the more complex your tree will be. The
training error will off-course decrease if we increase the max_depth value but
when our test data comes into the picture, we will get a very bad accuracy.
Hence you need a value that will not overfit as well as underfit our data and for
Another way is to set the minimum number of samples for each spilt. It is
samples to reach a decision. That means if a node has less than 10 samples
then using this parameter, we can stop the further splitting of this node and make
it a leaf node.
to be in the leaf node. The more you increase the number, the more is the
possibility of overfitting.
Pruning
Pruning is another method that can help us avoid overfitting. It helps in improving
the performance of the Decision tree by cutting the nodes or sub-nodes which
are not significant. Additionally, it removes the branches which have ver y low
importance.
Pre-pruning – we can stop growing the tree earlier, which means we can
Post-pruning – once our tree is built to its depth, we can start pruning
a) What are the uses of training data set and test data set for a decision tree classification scheme?
b) Define the entropy gain and gini's index.
c) Write an algorithm for Decision tree construction and mention the criterion for splitinng an
attribute.
d) Generate classification rules from a decision tree for the above database using entropy gain
computation.