0% found this document useful (0 votes)
19 views

Decision Tree

Decision tree

Uploaded by

aditya.cse121118
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Decision Tree

Decision tree

Uploaded by

aditya.cse121118
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Classification and Predication in Data

Mining
Two forms of data analysis can be used to extract models describing important classes
or predict future data trends. These two forms are as follows:

1. Classification
2. Prediction

We use classification and prediction to extract a model, representing the data classes
to predict future data trends. Classification predicts the categorical labels of data with
the prediction models. This analysis provides us with the best understanding of the
data at a large scale.

Classification models predict categorical class labels, and prediction models predict
continuous-valued functions. For example, we can build a classification model to
categorize bank loan applications as either safe or risky or a prediction model to
predict the expenditures in dollars of potential customers on computer equipment
given their income and occupation.

What is Classification?
Classification is to identify the category or the class label of a new observation. First, a
set of data is used as training data. The set of input data and the corresponding
outputs are given to the algorithm. So, the training data set includes the input data
and their associated class labels. Using the training dataset, the algorithm derives a
model or the classifier. The derived model can be a decision tree, mathematical
formula, or a neural network. In classification, when unlabeled data is given to the
model, it should find the class to which it belongs. The new data provided to the model
is the test data set.

Classification is the process of classifying a record. One simple example of classification


is to check whether it is raining or not. The answer can either be yes or no. So, there is
a particular number of choices. Sometimes there can be more than two classes to
classify. That is called multiclass classification.

The bank needs to analyze whether giving a loan to a particular customer is risky or
not. For example, based on observable data for multiple loan borrowers, a
classification model may be established that forecasts credit risk. The data could track
job records, homeownership or leasing, years of residency, number, type of deposits,
historical credit ranking, etc. The goal would be credit ranking, the predictors would
be the other characteristics, and the data would represent a case for each consumer.
In this example, a model is constructed to find the categorical label. The labels are risky
or safe.

What is a Decision Tree?

A decision tree is a non-parametric supervised learning algorithm for

classification and regression tasks. It has a hierarchical tree structure

consisting of a root node, branches, internal nodes, and leaf nodes. Decision

trees are used for classification and regression tasks, providing easy-to-

understand models. Decision trees can be used for classification as well as

regression problems. The name itself suggests that it uses a flowchart like a tree

structure to show the predictions that result from a series of feature-based splits.

It starts with a root node and ends with a decision made by leaves.
Decision Tree Terminologies
Before learning more about decision trees let’s get familiar with some of the
terminologies:
 Root Node: The initial node at the beginning of a decision tree, where the
entire population or dataset starts dividing based on various features or
conditions.
 Decision Nodes: Nodes resulting from the splitting of root nodes are
known as decision nodes. These nodes represent intermediate decisions
or conditions within the tree.
 Leaf Nodes: Nodes where further splitting is not possible, often indicating
the final classification or outcome. Leaf nodes are also referred to as
terminal nodes.
 Sub-Tree: Similar to a subsection of a graph being called a sub-graph, a
sub-section of a decision tree is referred to as a sub-tree. It represents a
specific portion of the decision tree.
 Pruning: The process of removing or cutting down specific nodes in a
decision tree to prevent overfitting and simplify the model.
 Branch / Sub-Tree: A subsection of the entire decision tree is referred to
as a branch or sub-tree. It represents a specific path of decisions and
outcomes within the tree.
 Parent and Child Node: In a decision tree, a node that is divided into sub-
nodes is known as a parent node, and the sub-nodes emerging from it are
referred to as child nodes. The parent node represents a decision or
condition, while the child nodes represent the potential outcomes or further
decisions based on that condition.
Example of Decision Tree
Let’s understand decision trees with the help of an example:

Decision trees are upside down which means the root is at the top and then this
root is split into various several nodes. Decision trees are nothing but a bunch of
if-else statements in layman's terms.
How do decision tree algorithms work?
1. Starting at the Root: The algorithm begins at the top, called the “root
node,” representing the entire dataset.
2. Asking the Best Questions: It looks for the most important feature or
question that splits the data into the most distinct groups. This is like
asking a question at a fork in the tree.
3. Branching Out: Based on the answer to that question, it divides the data
into smaller subsets, creating new branches. Each branch represents a
possible route through the tree.
4. Repeating the Process: The algorithm continues asking questions and
splitting the data at each branch until it reaches the final “leaf nodes,”
representing the predicted outcomes or classifications.

Algorithm: Generate_decision_tree
Input:
Data partition, D, which is a set of training tuples and their associated class labels.
attribute list, the set of candidate attributes.
Attribute selection method, a procedure to determine the splitting criterion that best
partitions the data tuples into individual classes.
This criterion includes a splitting attribute and either a splitting point or a splitting
subset.
Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C


then return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list) to find the best splitting_criterion;

label node N with splitting_criterion;

if splitting_attribute is discrete-valued and multiway splits are allowed then


// no restricted to binary trees

attribute_list = splitting attribute; // remove splitting attribute

for each outcome j of splitting criterion

let Dj be the set of data tuples in D satisfying outcome j


if Dj is empty then
attach a leaf labeled with the majority class in D to node N;
else
attach the node returned by Generate decision tree(Dj, attribute list) to
node N;
end for
return N;

Decision Tree Assumptions


Binary Splits
Decision trees typically make binary splits, meaning each node divides the data
into two subsets based on a single feature or condition
Recursive Partitioning
Decision trees use a recursive partitioning process, where each node is divided
into child nodes, and this process continues until a stopping criterion is met.
Feature Independence
Decision trees assume that the features used for splitting nodes are independent
Homogeneity
Decision trees aim to create homogeneous subgroups in each node, meaning that
the samples within a node are as similar as possible regarding the target variable.
Top-Down Greedy Approach
Decision trees are constructed using a top-down, greedy approach, where each
split is chosen to maximize information gain or minimize impurity. This may not
always result in the globally optimal tree.
Categorical and Numerical Features
Decision trees can handle both categorical and numerical features. However, they
may require different splitting strategies for each type.
Overfitting
Decision trees are prone to overfitting when they capture noise in the data. Pruning
and setting appropriate stopping criteria are used to address this assumption.
Impurity Measures
Decision trees use impurity measures such as Gini impurity or entropy to evaluate
how well a split separates class. The choice of impurity measure can impact tree
construction.
No Missing Values
Decision trees assume that there are no missing values in the dataset
Equal Importance of Features
Decision trees may assume equal importance for all features.
No Outliers
Decision trees are sensitive to outliers, and extreme values can influence their
construction.
Sensitivity to Sample Size
Small datasets may lead to overfitting, and large datasets may result in overly
complex trees. The sample size and tree depth should be balanced.
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:

o Information gain is the measurement of changes in entropy after the


segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:

Information Gain
= Entropy(S)- [(Weighted Avg) *Entropy(each fea
ture)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)


Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no
Entropy
Entropy is nothing but the uncertainty in our dataset or measure of disorder. Let
me try to explain this with the help of an example.
Suppose you have a group of friends who decides which movie they can watch
together on Sunday. There are 2 choices for movies, one is “Lucy” and the second
is “Titanic” and now everyone has to tell their choice. After everyone gives their
answer, we see that “Lucy” gets 4 votes and “Titanic” gets 5 votes. Which movie
do we watch now? Isn’t it hard to choose 1 movie now because the votes for both
the movies are somewhat equal.
This is exactly what we call disorderness, there is an equal number of votes for
both movies, and we can’t really decide which movie we should watch. It would
have been much easier if the votes for “Lucy” were 8 and for “Titanic” it was 2.
Here we could easily say that the majority of votes are for “Lucy” hence everyone
will be watching this movie.
In a decision tree, the output is mostly “yes” or “no”
The formula for Entropy is shown below:

Here,
 p+ is the probability of positive class
 p– is the probability of negative class
 S is the subset of the training example
How do Decision Trees use Entropy?
Suppose a feature has 8 “yes” and 4 “no” initially, after the first split the left
node gets 5 ‘yes’ and 2 ‘no ’whereas right node gets 3 ‘yes’ and 2 ‘no’.
We see here the split is not pure, why? Because we can still see some negative
classes in both the nodes. To make a decision tree, we need to calculate the
impurity of each split, and when the purity is 100%, we make it as a leaf node.
To check the impurity of feature 2 and feature 3 we will take the help for Entropy
formula.
For feature 3,

We can clearly see from the tree itself that left node has low entropy or more

purity than right node since left node has a greater number of “yes” and it is easy

to decide here.
Always remember that the higher the Entropy, the lower will be the purity and the

higher will be the impurity.

Information Gain
Information gain measures the reduction of uncertainty given some feature and it
is also a deciding factor for which attribute should be selected as a decision node
or root node.

It is just entropy of the full dataset – entropy of the dataset given some feature.
To understand this better let’s consider an example:Suppose our entire
population has a total of 30 instances. The dataset is to predict whether the
person will go to the gym or not. Let’s say 16 people go to the gym and 14
people don’t
Now we have two features to predict whether he/she will go to the gym or not.
 Feature 1 is “Energy” which takes two values “high” and “low”
 Feature 2 is “Motivation” which takes 3 values “No motivation”, “Neutral”
and “Highly motivated”.
Let’s see how our decision tree will be made using these 2 features. We’ll use
information gain to decide which feature should be the root node and which
feature should be placed after the split.

Let’s calculate the entropy


To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Energy), information gain will

be:

Our parent entropy was near 0.99 and after looking at this value of information
gain, we can say that the entropy of the dataset will decrease by 0.37 if we make
“Energy” as our root node.

Similarly, we will do this with the other feature “Motivation” and calculate its
information gain.
Let’s calculate the entropy here:

To see the weighted average of entropy of each node we will do as follows:

Now we have the value of E(Parent) and E(Parent|Motivation), information gain

will be:

We now see that the “Energy” feature gives more reduction which is 0.37 than

the “Motivation” feature. Hence, we will select the feature which has the highest

information gain and then split the node based on that feature.

In this example “Energy” will be our root node and we’ll do the same for sub -

nodes. Here we can see that when the energy is “high” the entropy is low and

hence we can say a person will definitely go to the gym if he has high energy,

but what if the energy is low? We will again split the node based on the new

feature which is “Motivation”.


2. Gini Index:

o Gini index is a measure of impurity or purity used while creating a decision tree
in the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Equation of Gini index:

where pi is the probability of an object being classified to a specific class.

Gini index example


It consists of 14 rows and 4 columns. The table depicts on which factor heart disease occurs
in person(target) dependent variable depending on HighBP, Highcholestrol, FBS(fasting
blood sugar).
Note: The original values are converted into 1 and 0 which depict numeric classification
Decision tree for the above table:
The factor which gives the least Gini index is the winner, i.e,.based on that the decision tree
is built.
Now Finding Gini index for individual columns
1. Gini index for High Bps:
Decision tree for High BPS

Probability for parent node:


P0= 10/14
P1=4/14
Now we calculate for child node :

1) For BPS=1,
If (Bps=1 and target =1)=8/10
if(Bps=1 and target=0)=2/10
Gini index PBPS1=1-{(PBPS1)2+(PBPS0)2
= 1-{{8/10)2+(2/10)2}
=0.32

2) if BPS=0,
If (BPS=0 and target=0)=4/4=1
If (BPS=0 and target=1)=0
Gini index PBPS0=1-{(1)-(0)}
= 1-1
=0

Weighted Gini index


w.g =P0*GBPS0+ P1*GBPS1 = 4/14*0 + 10/14*0.32
= 0.229

2. Gini index for High Cholesterol:

Probability of the parent node


P1=11/14
P0=3/13

i) For HChol.=1
If (Hchol.=1 and target=1)=7/11
If (HChol.=1 and target=0)=4/11
Gini index = 1-[(7/11)2+(4/11)2]
= 0.46

ii) If HChol.=0
If (Hchol.=0 and target=1)=1/3
If (HChol.=0 and target=0)=2/3
Gini index= 1-[(1/3)2+(2/3)2]
= 0.55

Weighted Gini index = P0*GHChol.0+P1*GHChol.1


= 3/14*055+11/14*0.46
= 0.47

3. Gini index for FBPS:

Decision tree for FBPS


Probability of parent node
P1=2/14
P0=12/14
i) for FBPS=1

If (FBps=1 and target =1)=2/2


if(FBps=1 and target=0)=0
Gini index PFBPS1=1-{(PFBPS1)2+(PFBPS0)2
= 1- [(1)2+0]
=1-1=0

ii) for FBPS=0,


If (FBps=0 and target =1)=6/12=0.5
if(FBps=0 and target=0)=6/12=0.5
Gini index PFBPS0=1-{(PFBPS1)2+(PFBPS0)2]
= 1-[(0.5)2+(0.5)2]
= 0.5
Weighted Gini index = P0*GFBPS0+ P1*GFBPS1
= 6/7*0.5+1/7*0
=0.42

Comparing Gini Index:

As HighBPS is less it is the winner

HighBPS is used as the root node for constructing of Decision Tree and the further tree is
built.

Mtech 7.3.24
ANURAAGA DSC3002
12:33 PM
Anuraaga Nath
DIPAYAN DSC3004
12:33 PM
Dipayan Banerjee
Sanghamitra Majumder
12:33 PM
SANGHAMITRA MAJUMDER - Present Mam.
SUBARNA DSC3008
12:33 PM
Subarna Paul
ARPITA DSC3003
12:33 PM
Arpita De
snigdha chandra
12:33 PM
Snigdha Chandra
KUHELI DSC3005
12:33 PM
Kuheli Banik -11561723005

Overfitting
Overfitting occurs when our machine learning model tries to cover all the data points
or more than the required data points present in the given dataset. Because of this,
the model starts caching noise and inaccurate values present in the dataset, and all
these factors reduce the efficiency and accuracy of the model. The overfitted model
has low bias and high variance.

Overfitting in Decision Trees

In decision trees, In order to fit the data (even noisy data), the model
keeps generating new nodes and ultimately the tree becomes too
complex to interpret. The decision tree predicts well for the training
data but can be inaccurate for new data. If a decision tree model is
allowed to train to its full potential, it can overfit the training
data.There are techniques to prevent the overfitting of decision trees
What is Pruning?

Pruning is a technique that removes parts of the decision tree and


prevents it from growing to its full depth. Pruning removes those
parts of the decision tree that do not have the power to classify
instances. Pruning can be of two types — Pre-Pruning and Post-
Pruning.

The above example highlights the differences between a pruned and


an unpruned decision tree. The unpruned tree is denser, more
complex, and has a higher variance — resulting in overfitting.

Pre-Pruning
Pre-Pruning, also known as ‘Early Stopping’ or ‘Forward Pruning’,
stops the growth of the decision tree — preventing it from reaching
its full depth. It stops the non-significant branches from generating
in a decision tree. Pre-Pruning involves the tuning of
the hyperparameters prior to training the model.
The hyperparameters that can be tuned for pre-pruning or early
stopping are max_depth, min_samples_leaf, and
min_samples_split.

 max_depth: Specifies the maximum depth of the


tree. If None, then nodes are expanded until all leaves are
pure or until all leaves contain less than min_samples_split
samples.

 min_samples_leaf: Specifies the minimum number of


samples required at a leaf node.

 min_samples_split: Specifies the minimum number of


samples required to split an internal node.

Post-Pruning

Post-Pruning or ‘backward pruning’ is a technique that eliminates


branches from a “completely grown” decision tree model to
reduce its complexity and variance. This technique allows the
decision tree to grow to its full depth, then removes branches to
prevent the model from overfitting. In Post-Pruning, non-significant
branches of the model are removed using the Cost Complexity
Pruning (CCP) technique.

When to Stop Splitting?

You must be asking this question to yourself that when do we stop growing our

Decision tree? Usually, real-world datasets have a large number of features,


which will result in a large number of splits, which in turn gives a huge tree. Such

trees take time to build and can lead to overfitting. That means the tree will give

very good accuracy on the training dataset but will give bad accuracy in test

data.

There are many ways to tackle this problem through hyperparameter tuning. We

can set the maximum depth of our decision tree using themax_depth parameter.

The more the value of max_depth, the more complex your tree will be. The

training error will off-course decrease if we increase the max_depth value but

when our test data comes into the picture, we will get a very bad accuracy.

Hence you need a value that will not overfit as well as underfit our data and for

this, you can use GridSearchCV.

Another way is to set the minimum number of samples for each spilt. It is

denoted by min_samples_split. Here we specify the minimum number of

samples required to do a spilt. For example, we can use a minimum of 10

samples to reach a decision. That means if a node has less than 10 samples

then using this parameter, we can stop the further splitting of this node and make

it a leaf node.

There are more hyperparameters such as :

 min_samples_leaf – represents the minimum number of samples required

to be in the leaf node. The more you increase the number, the more is the

possibility of overfitting.

 max_features – it helps us decide what number of features to consider

when looking for the best split.


To read more about these hyperparameters you can read ithere.

Pruning

Pruning is another method that can help us avoid overfitting. It helps in improving

the performance of the Decision tree by cutting the nodes or sub-nodes which

are not significant. Additionally, it removes the branches which have ver y low

importance.

There are mainly 2 ways for pruning:

 Pre-pruning – we can stop growing the tree earlier, which means we can

prune/remove/cut a node if it has low importance while growing the tree.

 Post-pruning – once our tree is built to its depth, we can start pruning

the nodes based on their significance.

a) What are the uses of training data set and test data set for a decision tree classification scheme?
b) Define the entropy gain and gini's index.

c) Write an algorithm for Decision tree construction and mention the criterion for splitinng an
attribute.

d) Generate classification rules from a decision tree for the above database using entropy gain
computation.

You might also like