ML Unit 5
ML Unit 5
Unit 5
Must Read: Deep Learning vs Machine Learning – Concepts, Applications, and Key
Differences
Structure of MultiLayer Perceptron Neural Network
This network has three main layers that combine to form a complete Artificial Neural
Network. These layers are as follows:
Input Layer
It is the initial or starting layer of the Multilayer perceptron. It takes input from the training
data set and forwards it to the hidden layer. There are n input nodes in the input layer. The
number of input nodes depends on the number of dataset features. Each input vector
variable is distributed to each of the nodes of the hidden layer.
Must Explore – Data Science Courses
Hidden Layer
It is the heart of all Artificial neural networks. This layer comprises all computations of the
neural network. The edges of the hidden layer have weights multiplied by the node values.
This layer uses the activation function.
There can be one or two hidden layers in the model.
Several hidden layer nodes should be accurate as few nodes in the hidden layer make the
model unable to work efficiently with complex data. More nodes will result in an overfitting
problem.
Output Layer
This layer gives the estimated output of the Neural Network. The number of nodes in the
output layer depends on the type of problem. For a single targeted variable, use one node.
N classification problem, ANN uses N nodes in the output layer.
Working of MultiLayer Perceptron Neural Network
The input node represents the feature of the dataset.
Each input node passes the vector input value to the hidden layer.
In the hidden layer, each edge has some weight multiplied by the input variable.
All the production values from the hidden nodes are summed together. To
generate the output
The activation function is used in the hidden layer to identify the active nodes.
The output is passed to the output layer.
Calculate the difference between predicted and actual output at the output layer.
The model uses backpropagation after calculating the predicted output.
Backpropagation:
In machine learning, backpropagation is an effective algorithm used to train
artificial neural networks, especially in feed-forward neural networks.
Backpropagation is an iterative algorithm, that helps to minimize the cost
function by determining which weights and biases should be adjusted. During
every epoch, the model learns by adapting the weights and biases to minimize
the loss by moving down toward the gradient of the error. Thus, it involves the
two most popular optimization algorithms, such as gradient
descent or stochastic gradient descent.
Computing the gradient in the backpropagation algorithm helps to minimize
the cost function and it can be implemented by using the mathematical rule
called chain rule from calculus to navigate through complex layers of the neural
network.
fig(a) A simple illustration of how the backpropagation works by adjustments of weights
Note that, our actual output is 0.5 but we obtained 0.67. To calculate the error, we can
use the below formula:
𝐸𝑟𝑟𝑜𝑟𝑗=𝑦𝑡𝑎𝑟𝑔𝑒𝑡–𝑦5Errorj=ytarget–y5
Error = 0.5 – 0.67
= -0.17
Using this error value, we will be backpropagating.
Implementing Backward Propagation
Each weight in the network is changed by,
∇wij = η ?j Oj
?j = Oj (1-Oj)(tj - Oj) (if j is an output unit)
?j = Oj (1-O)∑k ?k wkj (if j is a hidden unit)
where ,
Note: A decision tree can contain categorical data (YES/NO) as well as numeric data.
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The
complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider
the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which
are:
o Information Gain
o Gini Index
1. Information Gain:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
So all below mentioned measures differ in formula but align in goal. Watch till the end to
Remember this
Make sure you understand that impurity measure is calculated for each leaf node, and its
weighted average is the corresponding impurity measure for root node, based on which we
Let’s take an example with Entropy and solve to see the exact formulation.
Entropy
After taking weighted average for a feature, we need to check if this feature brings the most
where
while vertical division is that of Credit Rating(Excellent, Good, Poor). So the number 3 in table
implies, that out of total 14 companies there were 3 companies which got ‘Excellent rating’
Calculation of Entropy for deciding if Credit Rating should be the first split.
Next entropy over each Leaf node and then weighted average over credit rating split
Note greater the entropy, worse is the current feature for split at present level.
We now calculate information gain(Higher the better, or lower the conditional entropy)
So we get 0.375 as the IG from Credit Rating as the metric for classification of data over
liability status. If we had suppose stock price as a independent feature, we would have done
the same thing for it as well. Then we would have compared the result for both, and one with
higher information gain would have been our first decision variable for splitting.
Now
Gini Index
And one offering highest reduction is chosen as decision variable for splitting.
Classification Error
Gain Ratio
Impurity measures such as entropy and Gini Index tend to favor attributes that have large
number of distinct values. Therefore Gain Ratio is computed which is used to determine the
goodness of a split. Every splitting criterion has their own significance and usage according to
Twoing Criteria
The Gini Index may encounter problems when the domain of the target attribute is relatively
wide. In this case it is possible to employ binary criterion called twoing criteria. This criterion
is defined as:
Where, p (i/t) denote the fraction of records belonging to class i at a given node t
Little less I could find about it, have a look at this for more understanding.
Highlights:
1. Binary classification: These are primarily used for Binary split, i.e. two leaf nodes,
however when multilevel split is there, we can convert them to Binary split
2. Impurity Index(like Information Gain, Gini Index) are concave functions, and we
need to maximize the reduction in impurity. Note as below, graphically also they
the choice of feature for splitting but following different paths. Note Classification Error gives
From this figure, we are trying to compare Entropy(or Gini) method with Classification Error.
We are trying to compare Impurity before(Red mark)and after(Green dot) splitting. We want
the vertical distance between these two points to maximize, so graphically it is quite intuitive
that Classification error being straight Line leaves no space between these two points,
follow, while in Numerical our work is just to find average of two corresponding
observations(arranged in ascending) and then check the split’s entropy reduction taking each
such average as a cutoff. The one providing the max reduction in impurity is chosen as cutoff
value.
ID3 Algorithm
ID3 stands for Iterative Dichotomiser 3 which is a learning algorithm for Decision Tree
introduced by Quinlan Ross in 1986. ID3 is an iterative algorithm where a subset(window) of
the training set is chosen at random to build a decision tree. This tree will classify every
objects within this window correctly. For all the other objects that are not there in the
window, the tree tries to classify them and if the tree gives correct answer for all these
objects then the algorithm terminates. If not, then the incorrectly classified objects are
added to the window and the process continues. This process continues till a correct
Decision Tree is found. This method is fast and it finds the correct Decision Tree in a few
iterations. Consider an arbitrary collection of C objects. If C is empty or contains only objects
of a single class, then the Decision Tree will be a simple tree with just a leaf node labelled
with that class. Else consider T to be test on an object with outcomes{O₁, O₂, O₃….Ow}. Each
Object in C will give one of these Outcomes for the test T. Test T will partition C into {C₁, C₂,
C₃….Cw}. where Cᵢ contains objects having outcomes Oᵢ. We can visualize this with the
following diagram :
Source
When we replace each individual Cᵢ in the above figure with a Decision Tree for Cᵢ, we would
get a Decision tree for all the C. This is a divide-and-conquer strategy which will yield single-
object subsets that will satisfy the one-class requirement for a leaf. So as long as we have a
test which gives a non-trivial partition of any set of objects, this procedure will always
produce a Decision Tree that can correctly Classify each object in C. For simplicity, let us
consider the test to be branching on the values of an attribute, Now in ID3, For choosing the
root of a tree, ID3 uses an Information based approach that depends on two assumptions.
Let C contain p objects of class P and n of class N. These assumptions are :
1. A correct decision tree for C will also classify the objects in such a way that the
objects will have same proportion as in C. The Probability that an arbitrary object will
belong to class P is given below as :
Let us consider an attribute AA as the root with values {A₁,A₂…..,Av}. Now A will
partition C into {C₁,C₂…..,Cv}, where Cᵢ has those objects in CC that have a value of Aᵢ of A.
Now consider Cᵢ having pᵢ objects of class P and ni objects of class N. The expected
information required for the subtree for Cᵢ is I(pᵢ,nᵢ). The expected information required for
the tree with A as root is obtained by :
Now this is a weighted average where the weight for the i_th branch is the proportion of
Objects in _C that belong to Cᵢ. Now the information that is gained by selecting A as root is
given by :
Here I is called the Entropy. So here ID3 choose that attribute to branch for which there is
maximum Information Gain. So ID3 examines all the attributes and selects that A which
maximizes the gain(A) and then uses the same process recursively to form Decision Trees
for the subsets {C₁,C₂…..,Cv} till all the instances within a branch belong to same class.
Information gain is biased towards test with many occurances. Consider a feature that
uniquely identifies each instance of a Training set and if we split on this feature, it would
result in many brances with each branch containing instances of a single class alone(in other
words pure) since we get maximum information gain and hence results in the Tree to overfit
the Training set.
Gain Ratio
This is a modification to Information Gain to deal with the above mentioned problem. It
reduces the bias towards multi-valued attributes. Consider a training dataset which
contains p and n objects of class P and N respectively and the attribute A has values
{A₁,A₂…..,Av}. Let the number of objects with value Aᵢ of attribute A be pᵢ and nᵢ respectively.
Now we can define the Intrinsic Value(IV) of A as :
IV(A) measures the information content of the value of Attribute A. Now the Gain Ratio or
the Information Gain Ratio is defined as the ratio between the Information Gain and the
Intrinsic Value.
Now here we try to pick an Attribute for which the Gain Ratio is as large as possible. This
ratio may not be defined when IV(A) = 0. Also gain ratio may tend to favour those attributes
for which the Intrinsic Value is very small. When all the attributes are Binary, the gain ratio
criteria has been found to produce smaller trees.
C4.5 Algorithm
This is another algorithm that is used to create a Decision Tree. This is an extension to ID3
algorithm. Given a training dataset S = S₁,S₂,…. C4.5 grows the initial tree using the divide-
and-conquer approach as :
If all the instances in S belongs to the same class, or if S is small, then the tree is leaf
and is given the label of the same class.
Else, choose a test based on a single attribute which has two or more outcomes.
Then make the test as the root of the tree with a branch for each outcome of the
test.
Now partition S into corresponding subsets S₁,S₂,…., based on the outcome of each
case.
Now apply the procedure recursively to each of the subset S₁,S₂,….
Here the splitting criteria is Gain Ratio. Here the attributes can either be numeric or nominal
and this determines the format of the test of the outcomes. If an attribute is numeric, then
for an Attribute A, the test will be {A≤h, A>h}. Here h is the threshold found by sorting S on
the values of A and then choosing the split between successive values that maximizes the
Gain Ratio. Here the initial tree is Pruned to avoid Overfitting by removing those branches
that do not help and replacing them with leaf nodes. Unlike ID3, C4.5 handles missing
values. Missing values are marked separately and are not used for calculating Information
gain and Entropy.
This is a decision Tree Technique that produces either a Classification Tree when the
dependent variable is categorical or a Regression Tree when the dependent variable is
numeric.
Classification Trees :
Consider a Dataset (D) with features X = x₁,x₂….,xn and let y = y₁,y₂…ym be set of all the
possible classes that X can take. Tree based classifiers are formed by making repetitive splits
on X and subsequently created subsets of X. For eg. X could be divided such that {x|x₃≤53.5}
and {x|x₃>53.5}. Then the first set could be divided further into X₁ = {x|x₃≤53.5, x₁≤29.5} and
X₂={x|x₃≤53.5, x₁>29.5} and the other set could be split into X₃ = {x|x₃>53.5,x₁≤74.5} and X₄ =
{x|x₃>53.5, x₁>74.5}. This can be applied to problems with multiple classes also. When we
divide XX into subsets, these subsets need not be divided using the same variable. ie one
subset could be split based on x₁ and other on x₂. Now we need to determine how to best
split X into subsets and how to split the subsets also. CART uses binary partition recursively
to create a binary tree. There are three issues which CART addresses :
Identifying the Variables to create the split and determining the rule for creating the
split.
Determine if the node of a tree is terminal node or not.
Assigning a predicted class to each terminal node.
Creating Partition :
At each step, say for an attribute xᵢ, which is either numerical or ordinal, a subset of X can be
divided with a plane orthogonal to xᵢ axis such that one of the newly created subset has
xᵢ≤sᵢ and other has xᵢ>sᵢ. When an attribute xᵢ is nominal and having class label belonging to
a finite set Dk, a subset of X can be divided such that one of the newly created subset
has xᵢ ∈ Sᵢ, while other has xᵢ ∉ Sᵢ where Sᵢ is a proper subset of Dᵢ.
When Dᵢ contains d members then there are 2ᵈ−1 splits of this form to be considered. Splits
can also be done with more than one variable. Two or more continuous or ordinal variables
can be involved in a linear combination split in which a hyperplane which is not
perpendicular to one of the axis is used to split the subset of X. For examples one of the
subset created contains points for which 1.4x₂−10x₃≤10 and other subset contains points for
which 1.4x₂−10x₃>10. Similarly two or more nominal values can be involved in a Boolean
Split. For example consider two nominal variables gender and results(pass or fail) which are
used to create a split. In this case one subset could contain males and females who have
passed and other could contain all the males and females who have not passed.
However by using linear combination and boolean splits, the resulting tree becomes less
interpretable and also the computing time is more here since the number of candidate splits
are more. However by using only single variable split, the resulting tree becomes invariant
to the transformations used in the variables. But while using a linear combination split,
using transformations in the variables can make difference in the resulting tree. But by using
linear combination split, the resulting tree contains a classifier with less number of terminal
nodes, however it becomes less interpretable. So at the time of recursive partitioning, all
the possible ways of splitting X are considered and the one that leads to maximum purity is
chosen. This can be achieved using an impurity function which gives the proportions of
samples that belongs to the possible classes. One such function is called as Gini
impurity which is the measure of how often a randomly chosen element from a set would
be incorrectly labelled if it was randomly labelled according to the distribution of labels in
the subset. Let X contains items belonging to J classes and let pᵢ be the proportion of
samples labelled with class ii in the set where i∈{1,2,3….J}. Now Gini impurity for a set of
items with J classes is calculated as :
So in order to select a way to split the subset of X all the possible ways of splitting can be
considered and the one which will result in the greatest decrease in node impurity is
chosen.
To assign a class to a Terminal node a plurality rule is used : ie the class that is assigned to a
terminal node is the class that has largest number of samples in that node. If there is a node
where there is a tie in two or more classes for having largest number of samples, then if a
new datapoint x belongs to that node, then the prediction is arbitrarily selected from among
these classes.
The trickiest part of creating a Decision Tree is choosing the right size for the Tree. If we
keep on creating nodes, then the tree becomes complex and it will result in the resulting
Decision Tree created to Overfit. On the other hand, if the tree contains only a few terminal
nodes, then the resulting tree created is not using enough information in the training
sample to make predictions and this will lead to Underfitting. Inorder to determine the right
size of the tree, we can keep an independent test sample, which is a collection of examples
that comes from the same population or same distribution as the training set but not used
for training the model. Now for this test set, misclassification rate is calculated, which is the
proportion of cases in the test set that are misclassified when predicted classes are obtained
using the tree created from the training set. Now initially when a tree is being created, the
misclassification rate for the test starts to reduce as more nodes are added to it, but after
some point, the misclassification rate for the test set will start to get worse as the tree
becomes more complex. We could also use Cross-Validation to estimate the
misclassification rate. Now the question is how to grow a best tree or how to create a set of
candidate keys from which the best one can be selected based on the estimated
misclassification rates. So one method to do this is to grow a very large tree by splitting
subsets in the current partition of X even if the split doesn’t lead to appreciable decrease in
impurity. Now by using pruning, a finite sequence of smaller trees can be generated, where
in the pruning process the splits that were made are removed and a tree having a fewer
number of nodes is produced. Now in the sequence of trees, the first tree produced by
pruning will be a subtree of the original tree, and a second pruning step creates a subtree of
the first subtree and so on. Now for each of these trees, misclassification rate is calculated
and compared and the best performing tree in the sequence is chosen as the final classifier.
Regression Trees :
CART creates regression trees the same way it creates a tree for classification but with some
differences. Here for each terminal node, instead of assigning a class a numerical value is
assigned which is computed by taking the sample mean or sample median of the response
values for the training samples corresponding to the node. Here during the tree growing
process, the split selected at each stage is the one that leads to the greatest reduction in the
sum of absolute differences between the response values for the training samples
corresponding to a particular node and their sample median. The sum of square or absolute
differences is also used for tree pruning.
There are two techniques for pruning a decision tree they are : pre-pruning and post-
pruning.
Post-pruning
In this a Decision Tree is generated first and then non-significant branches are removed so
as to reduce the misclassification ratio. This can be done by either converting the tree to a
set of rules or the decision tree can be retained but replace some of its subtrees by leaf
nodes. There are various methods of pruning a tree. Here I will discuss some of them.
This is introduced by Quinlan in 1987 and this is one of the simplest pruning strategies.
However in practical Decision Tree pruning REP is seldom used since it requires a separate
set of examples for pruning. In REP each node is considered a candidate for pruning. The
available data is divided into 3 sets : one set for training(train set), the other for
pruning(validation set) and a set for testing(test set). Here a subtree can be replaced by leaf
node when the resultant tree performs no worse than the original tree for the validation
set. Here the pruning is done iteratively till further pruning is harmful. This method is very
effective if the dataset is large enough.
Error-Complexity Pruning
In this a series of trees pruned by different amounts are generated and then by examining
the number of misclassifications one of these trees is selected. While pruning, this method
takes into account of both the errors as well as the complexity of the tree. Before the
pruning process, each leaves will contain only examples which belong to one class, as
pruning progresses the leaves will include examples which are from different classes and the
leaf is allocated the class which occurs most frequently. Then the error rate is calculated as
the proportion of training examples that do not belong to that class. When the sub-tree is
pruned, the expected error rate is that of the starting node of the sub-tree since it becomes
a leaf node after pruning. When a sub-tree is not pruned then the error rate is the average
of error rates at the leaves weighted by the number of examples at each leaf. Pruning will
give rise to an increase in the error rate and dividing this error rate by number of leaves in
the sub-tree gives a measure of the reduction in error per leaf for that sub-tree. This is the
error-rate complexity measure. The error cost of node t is given by :
r(t) is the error rate of a node which is given as :
When a node is not pruned, the error cost for the sub-tree is :
The complexity cost is the cost of one extra leaf in the tree which is given as α. Then the
total cost of the sub-tree is given as :
α gives the reduction in error per leaf. So the algorithm first computes αα for each sub-tree
except the first and then selects that sub-tree that has the smallest value of αα for pruning.
This process is repeated till there are no sub-trees left and this will yield a series of
increasingly pruned trees. Now the final tree is chosen that has the lowest misclassification
rate for this we need to use an independent test data set. According to Brieman’s method,
the smallest tree with a mis-classification within one standard error of the minimum mis-
classification rate us chosen as the final tree. The standard error of mis-classification rate is
given as :
Where R is the mis-classification rate of the Pruned tree and N is the number of examples in
the test set.
Minimum-Error Pruning
This method is used to find a single tree that minimizes the error rate while classifying
independent sets of data. Consider a dataset with k classes and nn examples of which the
greatest number(nₑ) belong to class e. Now if the tree predicts class e for all the future
examples, then the expected error rate of pruning at a node assuming that each class is
equally likely is given as :
Where R is the mis-classification rate of the Pruned tree and N is the number of examples in
the test set.
Now for each node in the tree, calculate the expected error rate if that sub-tree is pruned.
Now calculate the expected error rate if the node is not pruned. Now do the process
recursively for each node and if pruning the node leads to increase in expected error rate,
then keep the sub-tree otherwise prune it. The final tree obtained will be pruned tree that
minimizes the expected error rate in classifying the independent data.
Pre-pruning
This is a method that is used to control the development of a decision tree by removing the
non-significant nodes. This is a top-down approach. Pre-pruning is not exactly a “pruning”
technique since it does not prune the branches of an existing tree. They only suppress the
growth of a tree if addition of branches does not improve the performance of the overall.
Chi-square pruning
Here the rows and columns correspond to the values of the nominal attribute :
Where :
Under Null Hypothesis these probabilities are independent and so the product of these two
probabilities will be the probability that an observation will fall into cell (i , j). Now consider
an attribute A and under null hypothesis A is independent of Class objects. Now using the
chi-squared test statistic, we can determine the confidence with which we can reject the
null hypothesis ie we retain a subtree instead of pruning it. If 𝜒² value is greater than a
threshold(t), then the information gain due to the split is significant. So we keep the sub-
tree, and if 𝜒² value is less than the threshold(t), then the information gained due to the
split is less significant and we can prune the sub-tree.
**Strengths:**
1. **Easy to understand**: Decision trees are simple to comprehend, even for non-technical
stakeholders. They provide a visual representation of the decision-making process, making it
easy to communicate complex decisions.
2. **Flexible**: Decision trees can handle both categorical and continuous data, and can be
used for both classification and regression problems.
3. **Handling missing values**: Decision trees can handle missing values by creating a
separate leaf node for each missing value or by using imputation techniques.
5. **Scalability**: Decision trees can be used for large datasets and can handle high-
dimensional data.
7. **Robust to outliers**: Decision trees are robust to outliers, as they use local learning
and don't rely on global assumptions.
**Weaknesses:**
2. **Unpruned trees can be complex**: If not pruned, decision trees can become too
complex, leading to overfitting and poor performance.
3. **Difficulty in handling correlated features**: Decision trees can struggle with correlated
features, as they may split on one feature and then fail to consider the other correlated
features.
5. **Not suitable for all types of data**: Decision trees are not suitable for data with
continuous outcomes or data with very large numbers of classes.
8. **May not handle categorical data with many categories well**: Decision trees may not
perform well when there are many categories in a categorical variable, as they may not be
able to effectively split on that variable.
It's worth noting that these strengths and weaknesses can be addressed by using various
techniques such as:
Overall, decision trees are a powerful and widely used machine learning algorithm, but they
require careful consideration of their strengths and weaknesses in order to achieve good
performance on a given problem.