ML - Module 3
ML - Module 3
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less time
in training but more time for predictions. Example: K-NN algorithm, Case-
based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.
1. Linear classifiers
2. Non-Linear classifiers
Linear Classifiers: Linear classifiers classify data into labels based on a linear
combination of input features. Therefore, these classifiers separate data using a line
or plane or a hyperplane (a plane in more than 2 dimensions). They can only be used
to classify data that is linearly separable. They can be modified to classify non-
linearly separable data
In the figure above, we have two classes, namely 'O' and '+.' To differentiate
between the two classes, an arbitrary line is drawn, ensuring that both the
classes are on distinct sides.
Since we can tell one class apart from the other, these classes are called
‘linearly-separable.’
However, an infinite number of lines can be drawn to distinguish the two
classes.
The exact location of this plane/hyperplane depends on the type of the linear
classifier.
Linear classifier algorithm
o Logistic Regression
o Support Vector Machines
o Linear Discriminant Classifier
o Perceptron
o SVM (linear kernel)
Non-Linear Classifiers: Non-Linear Classification refers to categorizing those
instances that are not linearly separable.
In the figure above, we have two classes, namely 'O' and 'X.' To differentiate
between the two classes, it is impossible to draw an arbitrary straight line to
ensure that both the classes are on distinct sides.
We notice that even if we draw a straight line, there would be points of the
first-class present between the data points of the second class.
In such cases, piece-wise linear or non-linear classification boundaries are
required to distinguish the two classes.
Non-linear classifier algorithm
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Evaluating a Classification model
For evaluating a Classification model, we have the following ways:
o For a good binary classifier model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual
value.
o The lower log loss represents the higher accuracy of the model.
Here Yi represents the actual class and log(p(yi)is the probability of that class.
2. AUC-ROC curve:
o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
o Email Spam Detection
o Speech Recognition
o Drugs Classification
In Regression, the output variable must be In Classification, the output variable must be
of continuous nature or real value. a discrete value.
The task of the regression algorithm is to The task of the classification algorithm is to
map the input value (x) with the continuous map the input value(x) with the discrete
output variable(y). output variable(y).
Regression Algorithms are used with Classification Algorithms are used with
continuous data. discrete data.
In Regression, we try to find the best fit line, In Classification, we try to find the decision
which can predict the output more boundary, which can divide the dataset into
accurately. different classes.
Regression algorithms can be used to solve Classifier algorithms can be used to solve
the regression problems such as Weather classification problems like Speech
Prediction, House price prediction, etc. Recognition, Identification of cancer cells, etc.
The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear into Binary Classifier and Multi-class
Regression. Classifier.
Logistic Regression
Logistic Regression: Logistic regression is a Supervised Learning technique. It is used
for predicting the categorical dependent variable using a given set of independent
variables.
Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.
The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on
its weight, etc.
o Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
For this purpose, a linear regression algorithm will help them decide. Plotting a
regression line by considering the employee’s performance as the independent
variable, and the salary increase as the dependent variable will make their task
easier.
Now, what if the organization wants to know whether an employee would get a
promotion or not based on their performance? The above linear graph won’t be
suitable in this case. As such, we clip the line at zero and one, and convert it into a
sigmoid curve (S curve).
Based on the threshold values, the organization can decide whether an employee
will get a salary increase or not.
𝜃=p/1-p
The values of odds range from zero to ∞ and the values of probability lies between
zero and one.
𝑦 = 𝛽0 + 𝛽1* 𝑥
Here, 𝛽0 is the y-intercept
Let Y = e 𝛽0+𝛽1 * 𝑥
p(x) = Y - Y(p(x))
p(x) + Y(p(x)) = Y
p(x)(1+Y) = Y
p(x) = Y / 1+Y
The equation of the sigmoid function is:
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
For a binary regression, the factor level one of the dependent variables should
represent the desired outcome
There is no problem scaling the input features—It does not require tuning
It gives a measure of how relevant a predictor (coefficient size) is, and its
direction of association (positive or negative)
Companies can predict whether they will gain or lose money in the next
quarter, year, or month based on their current performance
Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is one of the
commonly used dimensionality reduction techniques used for supervised
classification problems in machine learning to solve more than two-class
classification problems. It is also known as Normal Discriminant Analysis (NDA) or
Discriminant Function Analysis (DFA).
This can be used to project the features of higher dimensional space into
lower-dimensional space in order to reduce resources and dimensional costs.
It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Whenever there is a requirement to separate two or more classes having
multiple features efficiently, the Linear Discriminant Analysis model is
considered the most common technique to solve such classification problems.
For e.g., if we have two classes with multiple features and need to separate
them efficiently. When we classify them using a single feature, then it may
show overlapping.
Let's consider an example where we have two classes in a 2-D plane having an X-Y
axis, and we need to classify them efficiently. Here, LDA uses an X-Y axis to create a
new axis by separating them using a straight line and projecting data onto a new axis.
Hence, we can maximize the separation between these classes and reduce the 2-D
plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:
Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.
In other words, we can say that the new axis will increase the separation between
the data points of the two classes and plot them onto the new axis.
But Linear Discriminant Analysis fails when the mean of the distributions are
shared, as it becomes impossible for LDA to find a new axis that makes both the
classes linearly separable. In such cases, we use non-linear discriminant analysis.
Shortcomings of LDA
o Linear decision boundaries may not effectively separate non-linearly
separable classes. More flexible boundaries are desired.
o In cases where the number of observations exceeds the number of features,
LDA might not perform as desired. This is called Small Sample Size (SSS)
problem. Regularization is required.
Applications of LDA
o PCA is an unsupervised algorithm that does not care about classes and labels
and only aims to find the principal components to maximize the variance in
the given dataset. At the same time, LDA is a supervised algorithm that aims
to find the linear discriminants to represent the axes that maximize separation
between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to
PCA. However, PCA is assumed to be an as good performer for a
comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where
PCA is first followed by LDA.
The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
In SVM, we take the output of the linear function and if that output is greater
than 1, we identify it with one class and if the output is -1, we identify is with
another class. Since the threshold values are changed to 1 and -1 in SVM, we
obtain this reinforcement range of values([-1,1]) which acts as margin.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called
as Linear SVM classifier.
o Non-linear SVM (Kernel SVM): Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier. Kernel SVM Has more flexibility for non-
linear data because you can add more features to fit a hyperplane instead of a
two-dimensional space.
Support Vectors: The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.
Outliers in SVM
Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the
best hyperplane that maximizes the margin. SVM is robust to outliers.
SVM Kernel
SVM Kernel: The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts not separable problem
to separable problem. It is mostly useful in non-linear separation problems. Simply
put the kernel, it does some extremely complex data transformations then finds out
the process to separate the data based on the labels or outputs defined.
Advantages of SVM
Effective in high dimensional cases.
It works really well with a clear margin of separation.
Its memory efficient as it uses a subset of training points in the decision
function called support vectors.
Effective on datasets with multiple features, like financial or medical data.
Effective in cases where no. of features is greater than the no. of data points.
Different kernel functions can be specified for the decision function. You can
use common kernels, but it's also possible to specify custom kernels.
Disadvantages of SVM
If the number of features is a lot bigger than the number of data points,
avoiding over-fitting when choosing kernel functions and regularization term
is crucial.
It also doesn’t perform very well, when the data set has more noise i.e. target
classes are overlapping.
SVMs don't directly provide probability estimates. Those are calculated using
an expensive five-fold cross-validation.
Works best on small sample sets because of its high training time.
Perceptron
Perceptron: Perceptron is also understood as an Artificial Neuron or neural network
unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks.
Input Nodes or Input Layer: This is the primary component of Perceptron which
accepts the initial data into the system for further processing. Each input node
contains a real numerical value.
Wight and Bias: Weight parameter represents the strength of the connection
between units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.
Activation Function: These are the final and important components that help to
determine whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.
This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the
weight of input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.
Add a special term called bias 'b' to this weighted sum to improve the model's
performance.
∑wi*xi + b
Step-2: In the second step, an activation function is applied with the above-
mentioned weighted sum, which gives us output either in binary form or a
continuous value as follows:
Y = f(∑wi*xi + b)
In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up
all inputs (weight). After adding all inputs, if the total sum of all inputs is more than a
pre-determined value, the model gets activated and shows the output value as +1.
o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement. In this stage, the error between actual output
and demanded originated backward on the output layer and ended on the
input layer.
A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR,
XOR, NAND, NOT, XNOR, NOR.
Back Propagation: Back Propagation is the process of updating and finding the
optimal values of weights or coefficients which helps the model to minimize the
error i.e. difference between the actual and predicted values.
f(x)=1; if w.x+b>0
otherwise, f(x)=0
Characteristics of Perceptron
The perceptron model has the following characteristics.
Artificial Neural Network: ANNs are composed of multiple nodes, which imitate
biological neurons of human brain. The neurons are connected by links in various
layers of the networks and they interact with each other. The nodes can take input
data and perform simple operations on the data. The result of these operations is
passed to other neurons. The output at each node is called its activation or node
value. Each link is associated with weight. ANNs are capable of learning, which takes
place by altering weight values.
The typical Artificial Neural Network looks something like the given figure.
There are around 1000 billion neurons in the human brain. Each neuron has
an association point somewhere in the range of 1,000 and 100,000. In the
human brain, data is stored in such a manner as to be distributed, and we can
extract more than one piece of this data when necessary from our memory
parallelly. We can say that the human brain is made up of incredibly amazing
parallel processors.
Dendrites Inputs
Synapse Weights
Axon Output
Hidden Layer: The hidden layer presents in-between input and output layers. It
performs all the calculations to find hidden features and patterns.
Output Layer: The input goes through a series of transformations using the hidden
layer, which finally results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.
1. Feed-Forward ANN
2. FeedBack ANN
1. Feed-Forward ANN: In this ANN, the information flow is unidirectional. A unit
sends information to other unit from which it does not receive any information.
There are no feedback loops. They are used in pattern generation, recognition or
classification. They have fixed inputs and outputs..
2. FeedBack ANN: Here, feedback loops are allowed. In this type of ANN, the output
returns into the network to accomplish the best-evolved results internally. The
feedback networks feed information back into itself and are well suited to solve
optimization issues. The Internal system error corrections utilize feedback ANNs.
They are used in content addressable memories.
The input vector is the n-dimensional vector that you are trying to classify. The entire
input vector is shown to each of the RBF neurons.
Each RBF neuron stores a “prototype” vector which is just one of the vectors from
the training set. Each RBF neuron compares the input vector to its prototype, and
outputs a value between 0 and 1 which is a measure of similarity.
The shape of the RBF neuron’s response is a bell curve, as illustrated in the network
architecture diagram.
The prototype vector is also often called the neuron’s “center”, since it’s the value at
the center of the bell curve.
The output of the network consists of a set of nodes, one per category that we are
trying to classify. Each output node computes a sort of score for the associated
category. Typically, a classification decision is made by assigning the input to the
category with the highest score.
The score is computed by taking a weighted sum of the activation values from every
RBF neuron.
There are different possible choices of similarity functions, but the most popular is
based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional
input.
The hidden layer contains Gaussian transfer functions that are inversely proportional
to the distance of the output from the neuron's center.
Where x is the input, mu is the mean, and sigma is the standard deviation. This
produces the familiar bell curve shown below, which is centered at the mean, mu (in
the below plot the mean is 5 and sigma is 1).
The RBF neuron activation function is slightly different, and is typically written as:
Here the RBFN is viewed as a “3-layer network” where the input vector is the first
layer, the second “hidden” layer is the RBF neurons, and the third layer is the output
layer containing linear combination neurons.
Recurrent Neural Network
Recurrent Neural Network(RNN): Recurrent Neural Network(RNN) are a type
of Neural Network where the output from previous step are fed as input to the
current step.
Sigmoid Function
Tanh Function
Relu Function
Types of RNNs
There are different types of RNNs with varying architectures. Some examples are:
One To One: Here there is a single (xt,yt) pair. Traditional neural networks employ a
one to one architecture.
One To Many: In one to many networks, a single input at xt can produce multiple
outputs, e.g., (yt0,yt1,yt2). Music generation is an example area, where one to many
networks are employed.
Many To One: In this case many inputs from different time steps produce a single
output. For example, (xt,xt+1,xt+2) can produce a single output yt. Such networks are
employed in sentiment analysis or emotion detection, where the class label depends
upon a sequence of words.
Many To Many: There are many possibilities for many to many. An example is shown
above, where two inputs produce three outputs. Many to many networks are applied
in machine translation, e.g, English to French or vice versa translation systems.
2. Recurrent neural network are even used with convolutional layers to extend
the effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.
Gated Recurrent Units (GRU): These networks are designed to handle the vanishing
gradient problem. They have a reset and update gate. These gates determine which
information is to be retained for future predictions.
Long Short Term Memory (LSTM): LSTMs were also designed to address the
vanishing gradient problem in RNNs. LSTM use three gates called input, output and
forget gate. Similar to GRU, these gates determine which information to retain.
Convolutional Neural Network
Convolutional Neural Network: Convolutional Neural Network is one of the
technique to do image classification and image recognition in neural networks. It is
designed to process the data by multiple layers of arrays. The primary difference
between CNN and other neural network is that CNN takes input as a two-
dimensional array. And it operates directly on the images rather than focusing on
feature extraction which other neural networks do.
CNN takes an image as input, which is classified and process under a certain category
such as dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and
depends on the resolution of the image. Based on image resolution, it will see as h *
w * d, where h= height w= width and d= dimension. For example, An RGB image is 6
* 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 1 array of the matrix.
In CNN, each input image will pass through a sequence of convolution layers along
with pooling, fully connected layers, filters (Also known as kernels). After that, we
will apply the Soft-max function to classify an object with probabilistic values 0 and 1.
How Does a Computer read an image?
The image is broken into 3 color-channels which is Red, Green, and Blue. Each of
these color channels is mapped to the image's pixel.
Some neurons fires when exposed to vertices edges and some when shown
horizontal or diagonal edges. CNN utilizes spatial correlations which exist with the
input data. Each concurrent layer of the neural network connects some input
neurons. This region is called a local receptive field. The local receptive field focuses
on hidden neurons.
The hidden neuron processes the input data inside the mentioned field, not realizing
the changes outside the specific boundary.
o Convolutional layer
o ReLU Layer
o Pooling
o Fully Connected
1. Convolution Layer
Convolution layer is the first layer to extract features from an input image. By
learning image features using a small square of input data, the convolutional layer
preserves the relationship between pixels. It is a mathematical operation which takes
two inputs such as image matrix and a kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.
Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter
matrix 3*3 as:
The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called
"Features Map" and show as an output.
Convolution of an image with different filters can perform an operation such as blur,
sharpen, and edge detection by applying filters.
Strides
Stride is the number of pixels which are shift over the input matrix. When the stride
is equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride
is equaled to 2, then we move the filters to 2 pixels at a time. The following figure
shows that the convolution would work with a stride of 2.
Padding
Padding plays a crucial role in building the convolutional neural network. If the image
will get shrink and if we will take a neural network with 100's of layers on it, it will
give us a small image after filtered in the end.
If we take a three by three filter on top of a grayscale image and do the convolving
then what will happen?
It is clear from the above picture that the pixel in the corner will only get covers one
time, but the middle pixel will get covered more than once. It means that we have
more information on that middle pixel, so there are two downsides:
o Shrinking outputs
o Losing information on the corner of the image.
In this layer, we remove every negative value from the filtered images and replaces
them with zeros.
3. Pooling Layer
Pooling layer plays a vital role in pre-processing of any image. Pooling layer reduces
the number of the parameter when the image is too large. Pooling
is "downscaling" of the image achieved from previous layers. It can be compared to
shrink an image to reduce the image's density. Spatial pooling is also called
downsampling and subsampling, which reduce the dimensionality of each map but
remains essential information. These are the following types of spatial pooling.
Average Pooling: Down-scaling will perform by average pooling by dividing the input
into rectangular pooling regions and computing the average values of each area.
Sum Pooling: The sub-region for sum pooling and mean pooling are set the same as
for max-pooling but instead of using the max function we use sum or mean.
Flattening
Flattening is nothing but converting a 3D or 2D matrix into a 1D input for the model
this will be our last step to process the image and connect the inputs to a fully
connected dense layer for further classification.
Equation: Linear function has the equation similar to as of a straight line i.e.
y=x
No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
Range: -inf to +inf
Uses: Linear activation function is used at just one place i.e. output layer.
Issues: If we will differentiate linear function to bring non-linearity, result will
no more depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.
The activation that works almost always better than sigmoid function is Tanh
function also knows as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar and
can be derived from each other.
Equation: f(x) = tanh(x) = 2/(1 + e-2x) – 1 OR tanh(x) = 2 * sigmoid(2x) - 1
Value Range: -1 to +1
Nature: non-linear
Uses: Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0. This
makes learning for the next layer much easier.
The tanh functions have been used mostly in RNN for natural language
processing and speech recognition tasks.
4). ReLu (Rectified Linear Unit) Function:
The ReLU is the most used activation function in the world right now.Since, it
is used in almost all the convolutional neural networks or deep learning.
Equation: A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
Value Range: [0, inf)
Nature: non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
Uses: ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the ANN sparse making it efficient and easy for computation.
In simple words, RELU learns much faster than sigmoid and Tanh function.
The basic rule of thumb is if you really don’t know what activation function to
use, then simply use RELU as it is a general activation function in hidden layers
and is used in most cases these days.
It easily overfits compared to the sigmoid function and is one of the main
limitations. Some techniques like dropout are used to reduce the overfitting.
5). Softmax Function:
The softmax function is also a type of sigmoid function but is handy when we
are trying to handle mult- class classification problems.
Nature: non-linear
Uses: Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
Output: The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.
Decision Tree
Decision Tree: Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
The decisions or the test are performed on the basis of features of the given
dataset.
It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.
A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.
A decision tree can contain categorical data (YES/NO) as well as numeric data.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the following
algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step-3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o Gini Index is a score that evaluates how accurate a split is among the classified
groups. Gini index evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of
the elements within classes.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index. We prefer to have a Gini index score as low as possible.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:
A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:
o Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
o Decision trees perform classification without requiring much computation.
o Decision trees are able to handle both continuous and categorical variables.
o Decision trees provide a clear indication of which fields are most important
for prediction or classification.