0% found this document useful (0 votes)
6 views

ML - Module 3

Module 3 of notes of ML

Uploaded by

Amritesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

ML - Module 3

Module 3 of notes of ML

Uploaded by

Amritesh Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Module 3

Linear Classification, Logistic Regression, Linear Discriminant Analysis, Quadratic


Discriminant Analysis, Perceptron, Support Vector Machines + Kernels, Artificial
Neural Networks + Back Propagation, Decision Trees, Bayes Optimal Classifier, Naive
Bayes.
Linear Classification
Classification Algorithm
Classification: The Classification algorithm is a Supervised Learning technique that is
used to identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups.

Unlike regression, the output variable of Classification is a category, not a


value, such as "Green or Blue", "fruit or animal", etc.
The algorithm which implements the classification on a dataset is known as a
classifier. There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible


outcomes, then it is called as Binary Classifier. Examples: YES or NO, MALE
or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two
outcomes, then it is called as Multi-class Classifier. Example: Classifications
of types of crops, Classification of types of music.

Learners in Classification Problems


In the classification problems, there are two types of learners

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the
basis of the most related data stored in the training dataset. It takes less time
in training but more time for predictions. Example: K-NN algorithm, Case-
based reasoning
2. Eager Learners: Eager Learners develop a classification model based on a
training dataset before receiving a test dataset. Opposite to Lazy learners,
Eager Learner takes more time in learning, and less time in
prediction. Example: Decision Trees, Naïve Bayes, ANN.

Types of ML Classification Algorithms


Classification Algorithms can be further divided into the Mainly two category:

1. Linear classifiers
2. Non-Linear classifiers
Linear Classifiers: Linear classifiers classify data into labels based on a linear
combination of input features. Therefore, these classifiers separate data using a line
or plane or a hyperplane (a plane in more than 2 dimensions). They can only be used
to classify data that is linearly separable. They can be modified to classify non-
linearly separable data

Linear Classification refers to categorizing a set of data points into a discrete


class based on a linear combination of its explanatory variables.

In the figure above, we have two classes, namely 'O' and '+.' To differentiate
between the two classes, an arbitrary line is drawn, ensuring that both the
classes are on distinct sides.
Since we can tell one class apart from the other, these classes are called
‘linearly-separable.’
However, an infinite number of lines can be drawn to distinguish the two
classes.
The exact location of this plane/hyperplane depends on the type of the linear
classifier.
Linear classifier algorithm
o Logistic Regression
o Support Vector Machines
o Linear Discriminant Classifier
o Perceptron
o SVM (linear kernel)
Non-Linear Classifiers: Non-Linear Classification refers to categorizing those
instances that are not linearly separable.

Some of the classifiers that use non-linear functions to separate classes


are Quadratic Discriminant Classifier, Multi-Layer Perceptron (MLP), Decision
Trees, Random Forest, and K-Nearest Neighbours (KNN).

In the figure above, we have two classes, namely 'O' and 'X.' To differentiate
between the two classes, it is impossible to draw an arbitrary straight line to
ensure that both the classes are on distinct sides.
We notice that even if we draw a straight line, there would be points of the
first-class present between the data points of the second class.
In such cases, piece-wise linear or non-linear classification boundaries are
required to distinguish the two classes.
Non-linear classifier algorithm
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
Evaluating a Classification model
For evaluating a Classification model, we have the following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a


probability value between the 0 and 1.

o For a good binary classifier model, the value of log loss should be near to 0.

o The value of log loss increases if the predicted value deviates from the actual
value.

o The lower log loss represents the higher accuracy of the model.

Here Yi represents the actual class and log(p(yi)is the probability of that class.

 p(yi) is the probability of 1.

 1-p(yi) is the probability of 0.

2. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC
stands for Area Under the Curve.

o It is a graph that shows the performance of the classification model at


different thresholds.

o To visualize the performance of the multi-class classification model, we use


the AUC-ROC Curve.

o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on
Y-axis and FPR(False Positive Rate) on X-axis.
Use cases of Classification Algorithms
o Email Spam Detection

o Speech Recognition

o Identifications of Cancer tumor cells.

o Drugs Classification

o Biometric Identification, etc.

Difference between Regression and Classification

Regression Algorithm Classification Algorithm

In Regression, the output variable must be In Classification, the output variable must be
of continuous nature or real value. a discrete value.

The task of the regression algorithm is to The task of the classification algorithm is to
map the input value (x) with the continuous map the input value(x) with the discrete
output variable(y). output variable(y).

Regression Algorithms are used with Classification Algorithms are used with
continuous data. discrete data.

In Regression, we try to find the best fit line, In Classification, we try to find the decision
which can predict the output more boundary, which can divide the dataset into
accurately. different classes.

Regression algorithms can be used to solve Classifier algorithms can be used to solve
the regression problems such as Weather classification problems like Speech
Prediction, House price prediction, etc. Recognition, Identification of cancer cells, etc.

The regression Algorithm can be further The Classification algorithms can be divided
divided into Linear and Non-linear into Binary Classifier and Multi-class
Regression. Classifier.
Logistic Regression
Logistic Regression: Logistic regression is a Supervised Learning technique. It is used
for predicting the categorical dependent variable using a given set of independent
variables.

Logistic regression predicts the output of a categorical dependent variable.


Therefore the outcome must be a categorical or discrete value. It can be
either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value
as 0 and 1, it gives the probabilistic values which lie between 0 and 1.

Logistic Regression is much similar to the Linear Regression except that how
they are used. Linear Regression is used for solving Regression problems,
whereas Logistic regression is used for solving the classification problems.

In Logistic regression, instead of fitting a regression line, we fit an "S" shaped


logistic function, which predicts two maximum values (0 or 1).

The curve from the logistic function indicates the likelihood of something such
as whether the cells are cancerous or not, a mouse is obese or not based on
its weight, etc.

Logistic Regression is a significant machine learning algorithm because it has


the ability to provide probabilities and classify new data using continuous and
discrete datasets.

Logistic Regression can be used to classify the observations using different


types of data and can easily determine the most effective variables used for
the classification. The below image is showing the logistic function:
Note: Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples; Therefore, it
falls under the classification algorithm.

Type of Logistic Regression


On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.

o Multinomial: In multinomial Logistic regression, there can be 3 or more


possible unordered types of the dependent variable, such as "cat", "dogs", or
"sheep"

o Ordinal: In ordinal Logistic regression, there can be 3 or more possible


ordered types of dependent variables, such as "low", "Medium", or "High".

How Does the Logistic Regression Algorithm Work?


Consider the following example: An organization wants to determine an employee’s
salary increase based on their performance.

For this purpose, a linear regression algorithm will help them decide. Plotting a
regression line by considering the employee’s performance as the independent
variable, and the salary increase as the dependent variable will make their task
easier.
Now, what if the organization wants to know whether an employee would get a
promotion or not based on their performance? The above linear graph won’t be
suitable in this case. As such, we clip the line at zero and one, and convert it into a
sigmoid curve (S curve).

Based on the threshold values, the organization can decide whether an employee
will get a salary increase or not.

To understand logistic regression, let’s go over the odds of success.

Odds (𝜃) = Probability of an event happening / Probability of an event not happening

𝜃=p/1-p

The values of odds range from zero to ∞ and the values of probability lies between
zero and one.

Consider the equation of a straight line:

𝑦 = 𝛽0 + 𝛽1* 𝑥
Here, 𝛽0 is the y-intercept

𝛽1 is the slope of the line

x is the value of the x coordinate

y is the value of the prediction

Now to predict the odds of success, we use the following formula:

Exponentiating both the sides, we have:

Let Y = e 𝛽0+𝛽1 * 𝑥

Then p(x) / 1 - p(x) = Y

p(x) = Y(1 - p(x))

p(x) = Y - Y(p(x))

p(x) + Y(p(x)) = Y

p(x)(1+Y) = Y

p(x) = Y / 1+Y
The equation of the sigmoid function is:

The sigmoid curve obtained from the above equation is as follows:

Logistic Function (Sigmoid Function)


o The sigmoid function is a mathematical function used to map the predicted
values to probabilities.

o It maps any real value into another value within a range of 0 and 1.

o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is
called the Sigmoid function or the logistic function.

o In logistic regression, we use the concept of the threshold value, which


defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.

o The mathematically sigmoid function can be,


Logistic Regression Equation
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Assumption in a Logistic Regression Algorithm


 In a binary logistic regression, the dependent variable must be binary

 For a binary regression, the factor level one of the dependent variables should
represent the desired outcome

 Only meaningful variables should be included

 The independent variables should be independent of each other. This means


the model should have little or no multicollinearity

 The independent variables are linearly related to the log odds

 Logistic regression requires quite large sample sizes


Advantages of the Logistic Regression Algorithm
 Logistic regression performs better when the data is linearly separable

 It does not require too many computational resources as it’s highly


interpretable

 There is no problem scaling the input features—It does not require tuning

 It is easy to implement and train a model using logistic regression

 It gives a measure of how relevant a predictor (coefficient size) is, and its
direction of association (positive or negative)

Disadvantages of logistic regression

 Logistic regression fails to predict a continuous outcome.


 Logistic regression assumes linearity between the predicted (dependent)
variable and the predictor (independent) variables.
 Logistic regression may not be accurate if the sample size is too small.

Applications of Logistic Regression


 Using the logistic regression algorithm, banks can predict whether a customer
would default on loans or not

 To predict the weather conditions of a certain place (sunny, windy, rainy,


humid, etc.)

 Ecommerce companies can identify buyers if they are likely to purchase a


certain product

 Companies can predict whether they will gain or lose money in the next
quarter, year, or month based on their current performance

 To classify objects based on their features and attributes


Linear Discriminant Analysis
Discriminant Analysis: It is a manner to classify the targets based on some
assumptions, and then use them to predict the future of a process.

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is one of the
commonly used dimensionality reduction techniques used for supervised
classification problems in machine learning to solve more than two-class
classification problems. It is also known as Normal Discriminant Analysis (NDA) or
Discriminant Function Analysis (DFA).

This can be used to project the features of higher dimensional space into
lower-dimensional space in order to reduce resources and dimensional costs.
It is also considered a pre-processing step for modelling differences in ML and
applications of pattern classification.
Whenever there is a requirement to separate two or more classes having
multiple features efficiently, the Linear Discriminant Analysis model is
considered the most common technique to solve such classification problems.
For e.g., if we have two classes with multiple features and need to separate
them efficiently. When we classify them using a single feature, then it may
show overlapping.

To overcome the overlapping issue in the classification process, we must


increase the number of features regularly.

How Linear Discriminant Analysis (LDA) works


Using LDA we can easily transform a 2-D and 3-D graph into a 1-dimensional plane.

Let's consider an example where we have two classes in a 2-D plane having an X-Y
axis, and we need to classify them efficiently. Here, LDA uses an X-Y axis to create a
new axis by separating them using a straight line and projecting data onto a new axis.

Hence, we can maximize the separation between these classes and reduce the 2-D
plane into 1-D.
To create a new axis, Linear Discriminant Analysis uses the following criteria:

o It maximizes the distance between means of two classes.


o It minimizes the variance within the individual class.

Using the above two conditions, LDA generates a new axis in such a way that it can
maximize the distance between the means of the two classes and minimizes the
variation within each class.

In other words, we can say that the new axis will increase the separation between
the data points of the two classes and plot them onto the new axis.

But Linear Discriminant Analysis fails when the mean of the distributions are
shared, as it becomes impossible for LDA to find a new axis that makes both the
classes linearly separable. In such cases, we use non-linear discriminant analysis.

Extension to Linear Discriminant Analysis (LDA)


LDA has so many extensions and variations as follows:

1. Quadratic Discriminant Analysis (QDA): For multiple input variables, each


class deploys its own estimate of variance.
2. Flexible Discriminant Analysis (FDA): it is used when there are non-linear
groups of inputs are used, such as splines.
3. Flexible Discriminant Analysis (FDA): This uses regularization in the estimate
of the variance (actually covariance) and hence moderates the influence of
different variables on LDA.
Advantages of LDA

o LDA handles the multiple classification problems with well-separated classes


quite efficiently.
o LDA can also be used in data pre-processing to reduce the number of features,
just as PCA, which reduces the computing cost significantly.
o LDA is also used in face detection algorithms. In Fisherfaces, LDA is used to
extract useful data from different faces. Coupled with eigenfaces, it produces
effective results.

Shortcomings of LDA
o Linear decision boundaries may not effectively separate non-linearly
separable classes. More flexible boundaries are desired.
o In cases where the number of observations exceeds the number of features,
LDA might not perform as desired. This is called Small Sample Size (SSS)
problem. Regularization is required.

Assumptions for LDA


o Assumes the data to be distributed normally or Gaussian distribution of data
points i.e. each feature must make a bell-shaped curve when plotted.
o Each of the classes has identical covariance matrices.

Applications of LDA

o Face Recognition: Face recognition is the popular application of computer


vision, where each face is represented as the combination of a number of
pixel values. In this case, LDA is used to minimize the number of features to a
manageable number before going through the classification process. It
generates a new template in which each dimension consists of a linear
combination of pixel values. If a linear combination is generated using Fisher's
linear discriminant, then it is called Fisher's face.
o Medical: In the medical field, LDA has a great application in classifying the
patient disease on the basis of various parameters of patient health and the
medical treatment which is going on. On such parameters, it classifies disease
as mild, moderate, or severe. This classification helps the doctors in either
increasing or decreasing the pace of the treatment.
o Customer Identification: In customer identification, LDA is currently being
applied. It means with the help of LDA; we can easily identify and select the
features that can specify the group of customers who are likely to purchase a
specific product in a shopping mall. This can be helpful when we want to
identify a group of customers who mostly purchase a product in a shopping
mall.
o For Predictions: LDA can also be used for making predictions and so in
decision making. For example, "will you buy this product” will give a predicted
result of either one or two possible classes as a buying or not.
o In Learning: Nowadays, robots are being trained for learning and talking to
simulate human work, and it can also be considered a classification problem.
In this case, LDA builds similar groups on the basis of different parameters,
including pitches, frequencies, sound, tunes, etc.

Difference between Linear Discriminant Analysis and Principle


Component Analysis

o PCA is an unsupervised algorithm that does not care about classes and labels
and only aims to find the principal components to maximize the variance in
the given dataset. At the same time, LDA is a supervised algorithm that aims
to find the linear discriminants to represent the axes that maximize separation
between different classes of data.
o LDA is much more suitable for multi-class classification tasks compared to
PCA. However, PCA is assumed to be an as good performer for a
comparatively small sample size.
o Both LDA and PCA are used as dimensionality reduction techniques, where
PCA is first followed by LDA.

How to Prepare Data for LDA


Below are some suggestions that one should always consider while preparing the
data to build the LDA model:

o Classification Problems: LDA is mainly applied for classification problems to


classify the categorical output variable. It is suitable for both binary and multi-
class classification problems.
o Gaussian Distribution: The standard LDA model applies the Gaussian
Distribution of the input variables. One should review the univariate
distribution of each attribute and transform them into more Gaussian-looking
distributions. For e.g., use log and root for exponential distributions and Box-
Cox for skewed distributions.
o Remove Outliers: It is good to firstly remove the outliers from your data
because these outliers can skew the basic statistics used to separate classes in
LDA, such as the mean and the standard deviation.
o Same Variance: As LDA always assumes that all the input variables have the
same variance, hence it is always a better way to firstly standardize the data
before implementing an LDA model. By this, the Mean will be 0, and it will
have a standard deviation of 1.

Quadratic Discriminant Analysis


Quadratic Discriminant Analysis: QDA is a technique used to classify a target variable
from a class of independent variables into two or more classes, i.e., multi class
classification. This technique is a non-linear equivalent to linear discriminant analysis.

This is a variant of LDA and uses quadratic combinations of independent


variables to predict the class in the dependent variable.
It does not assume equal covariance of the classes, but the assumption of
Normal Distribution still holds.
In QDA, an individual covariance matrix is estimated for every class of
observations. So, it has a greater number of effective parameters than LDA.
QDA is particularly useful if there is prior knowledge that individual classes
exhibit distinct covariances.
QDA cannot be used as a dimensionality reduction technique.
QDA creates a quadratic decision boundary.
Support Vector Machine
Support Vector Machine: Support Vector Machine or SVM is one of the most popular
Supervised Learning algorithms, which is used for Classification as well as Regression
problems. However, primarily, it is used for Classification problems in Machine
Learning.

The goal of the SVM algorithm is to create the best line or decision boundary
that can segregate n-dimensional space into classes so that we can easily put
the new data point in the correct category in the future. This best decision
boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane.
These extreme cases are called as support vectors, and hence algorithm is
termed as Support Vector Machine.
The distance between the vectors and the hyperplane is called as margin. And
the goal of SVM is to maximize this margin. The hyperplane with maximum
margin is called the optimal hyperplane.
In SVM, we take the output of the linear function and if that output is greater
than 1, we identify it with one class and if the output is -1, we identify is with
another class. Since the threshold values are changed to 1 and -1 in SVM, we
obtain this reinforcement range of values([-1,1]) which acts as margin.
SVM algorithm can be used for Face detection, image classification, text
categorization, etc.
Types of SVM

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then
such data is termed as linearly separable data, and classifier is used called
as Linear SVM classifier.
o Non-linear SVM (Kernel SVM): Non-Linear SVM is used for non-linearly
separated data, which means if a dataset cannot be classified by using a
straight line, then such data is termed as non-linear data and classifier used is
called as Non-linear SVM classifier. Kernel SVM Has more flexibility for non-
linear data because you can add more features to fit a hyperplane instead of a
two-dimensional space.

Hyperplane and Support Vectors in the SVM algorithm


Hyperplane: There can be multiple lines/decision boundaries to segregate the
classes in n-dimensional space, but we need to find out the best decision boundary
that helps to classify the data points. This best boundary is known as the hyperplane
of SVM.

The dimensions of the hyperplane depend on the features present in the


dataset, which means if there are 2 features (as shown in image), then
hyperplane will be a straight line. And if there are 3 features, then hyperplane
will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the
maximum distance between the data points.

Support Vectors: The data points or vectors that are the closest to the hyperplane
and which affect the position of the hyperplane are termed as Support Vector. Since
these vectors support the hyperplane, hence called a Support vector.

Outliers in SVM
Here we have one blue ball in the boundary of the red ball. So how does SVM classify
the data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the outlier and finds the
best hyperplane that maximizes the margin. SVM is robust to outliers.
SVM Kernel
SVM Kernel: The SVM kernel is a function that takes low dimensional input space
and transforms it into higher-dimensional space, ie it converts not separable problem
to separable problem. It is mostly useful in non-linear separation problems. Simply
put the kernel, it does some extremely complex data transformations then finds out
the process to separate the data based on the labels or outputs defined.

Advantages of SVM
 Effective in high dimensional cases.
 It works really well with a clear margin of separation.
 Its memory efficient as it uses a subset of training points in the decision
function called support vectors.
 Effective on datasets with multiple features, like financial or medical data.
 Effective in cases where no. of features is greater than the no. of data points.
 Different kernel functions can be specified for the decision function. You can
use common kernels, but it's also possible to specify custom kernels.

Disadvantages of SVM
 If the number of features is a lot bigger than the number of data points,
avoiding over-fitting when choosing kernel functions and regularization term
is crucial.
 It also doesn’t perform very well, when the data set has more noise i.e. target
classes are overlapping.
 SVMs don't directly provide probability estimates. Those are calculated using
an expensive five-fold cross-validation.
 Works best on small sample sets because of its high training time.
Perceptron
Perceptron: Perceptron is also understood as an Artificial Neuron or neural network
unit that helps to detect certain input data computations in business intelligence.
Perceptron model is also treated as one of the best and simplest types of Artificial
Neural networks.

We can consider it as a single-layer neural network with four main


parameters, i.e., input values, weights and Bias, net sum, and an activation
function.
Perceptron is a building block of an Artificial Neural Network. Initially, in the
mid of 19th century, Mr. Frank Rosenblatt invented the Perceptron for
performing certain calculations to detect input data capabilities or business
intelligence.
Perceptron is a linear Machine Learning algorithm used for supervised
learning for various binary classifiers.
This algorithm enables neurons to learn elements and processes them one by
one during preparation.

Basic Components of Perceptron


Mr Frank Rosenblatt invented the perceptron model as a binary classifier which
contains three main components. These are as follows:

Input Nodes or Input Layer: This is the primary component of Perceptron which
accepts the initial data into the system for further processing. Each input node
contains a real numerical value.
Wight and Bias: Weight parameter represents the strength of the connection
between units. This is another most important parameter of Perceptron
components. Weight is directly proportional to the strength of the associated input
neuron in deciding the output. Further, Bias can be considered as the line of
intercept in a linear equation.

Activation Function: These are the final and important components that help to
determine whether the neuron will fire or not. Activation Function can be considered
primarily as a step function.

How does Perceptron work


In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and
Bias, net sum, and an activation function. The perceptron model begins with the
multiplication of all input values and their weights, then adds these values together
to create the weighted sum. Then this weighted sum is applied to the activation
function 'f' to obtain the desired output. This activation function is also known as
the step function and is represented by 'f'.

This step function or Activation function plays a vital role in ensuring that output is
mapped between required values (0,1) or (-1,1). It is important to note that the
weight of input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.

Perceptron model works in two important steps as follows:


Step-1: In the first step first, multiply all input values with corresponding weight
values and then add them to determine the weighted sum. Mathematically, we can
calculate the weighted sum as follows:

∑wi*xi = x1*w1 + x2*w2 +…wn*xn

Add a special term called bias 'b' to this weighted sum to improve the model's
performance.

∑wi*xi + b

Step-2: In the second step, an activation function is applied with the above-
mentioned weighted sum, which gives us output either in binary form or a
continuous value as follows:

Y = f(∑wi*xi + b)

Types of Perceptron Models


Based on the layers, Perceptron models are divided into two types. These are as
follows:

1. Single-layer Perceptron Model


2. Multi-layer Perceptron model

Single Layer Perceptron Model:


This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold
transfer function inside the model. The main objective of the single-layer perceptron
model is to analyze the linearly separable objects with binary outcomes.

In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters. Further, it sums up
all inputs (weight). After adding all inputs, if the total sum of all inputs is more than a
pre-determined value, the model gets activated and shows the output value as +1.

If the outcome is same as pre-determined or threshold value, then the performance


of this model is stated as satisfied, and weight demand does not change. However,
this model consists of a few discrepancies triggered when multiple weight inputs
values are fed into the model. Hence, to find desired output and minimize errors,
some changes should be necessary for the weights input.

"Single-layer perceptron can learn only linearly separable patterns."


Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the
same model structure but has a greater number of hidden layers.

The multi-layer perceptron model is also known as the Backpropagation algorithm,


which executes in two stages as follows:

o Forward Stage: Activation functions start from the input layer in the forward
stage and terminate on the output layer.
o Backward Stage: In the backward stage, weight and bias values are modified
as per the model's requirement. In this stage, the error between actual output
and demanded originated backward on the output layer and ended on the
input layer.

Hence, a multi-layered perceptron model has considered as multiple artificial neural


networks having various layers in which activation function does not remain linear,
similar to a single layer perceptron model. Instead of linear, activation function can
be executed as sigmoid, TanH, ReLU, etc., for deployment.

A multi-layer perceptron model has greater processing power and can process linear
and non-linear patterns. Further, it can also implement logic gates such as AND, OR,
XOR, NAND, NOT, XNOR, NOR.

Back Propagation: Back Propagation is the process of updating and finding the
optimal values of weights or coefficients which helps the model to minimize the
error i.e. difference between the actual and predicted values.

Advantages of Multi-Layer Perceptron:

o It can be used to solve complex nonlinear problems.


o It works well with both small and large input data.
o It helps us to obtain quick predictions after the training.
o It helps to obtain the same accuracy ratio with large as well as small data.

Disadvantages of Multi-Layer Perceptron:

o In Multi-layer perceptron, computations are difficult and time-consuming.


o The model functioning depends on the quality of the training.
o In multi-layer Perceptron, it is difficult to predict how much the dependent
variable affects each independent variable.
Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with
the learned weight coefficient 'w'.

Mathematically, we can express it as follows:

f(x)=1; if w.x+b>0

otherwise, f(x)=0

o 'w' represents real-valued weights vector


o 'b' represents the bias
o 'x' represents a vector of input x values.

Characteristics of Perceptron
The perceptron model has the following characteristics.

1. Perceptron is a machine learning algorithm for supervised learning of binary


classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made
whether the neuron is fired or not.
4. The activation function applies a step rule to check whether the weight
function is greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the
two linearly separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must
have an output signal; otherwise, no output will be shown.

Limitations of Perceptron Model


A perceptron model has limitations as follows:

o The output of a perceptron can only be a binary number (0 or 1) due to the


hard limit transfer function.
o Perceptron can only be used to classify the linearly separable sets of input
vectors. If input vectors are non-linear, it is not easy to classify them properly.
Artificial Neural Network
Biological neural networks: The human brain is composed of 86 billion nerve cells
called neurons. They are connected to other thousand cells by Axons. Inputs from
sensory organs are accepted by dendrites. These inputs create electric impulses,
which quickly travel through the neural network. A neuron can then send the
message to other neuron to handle the issue or does not send it forward.

Artificial Neural Network: ANNs are composed of multiple nodes, which imitate
biological neurons of human brain. The neurons are connected by links in various
layers of the networks and they interact with each other. The nodes can take input
data and perform simple operations on the data. The result of these operations is
passed to other neurons. The output at each node is called its activation or node
value. Each link is associated with weight. ANNs are capable of learning, which takes
place by altering weight values.
The typical Artificial Neural Network looks something like the given figure.

There are around 1000 billion neurons in the human brain. Each neuron has
an association point somewhere in the range of 1,000 and 100,000. In the
human brain, data is stored in such a manner as to be distributed, and we can
extract more than one piece of this data when necessary from our memory
parallelly. We can say that the human brain is made up of incredibly amazing
parallel processors.

Dendrites from Biological Neural Network represent inputs in Artificial Neural


Networks, cell nucleus represents Nodes, synapse represents Weights, and
Axon represents Output.

Relationship between Biological neural network and artificial neural network:

Biological Neural Network Artificial Neural Network

Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

The architecture of an artificial neural network


Artificial neural network consists of a large number of artificial neurons, which are
termed units arranged in a sequence of layers. Artificial Neural Network primarily
consists of three layers.
Input Layer: As the name suggests, it accepts inputs in several different formats
provided by the programmer.

Hidden Layer: The hidden layer presents in-between input and output layers. It
performs all the calculations to find hidden features and patterns.

Output Layer: The input goes through a series of transformations using the hidden
layer, which finally results in output that is conveyed using this layer.

The artificial neural network takes input and computes the weighted sum of the
inputs and includes a bias. This computation is represented in the form of a transfer
function.

It determines weighted total is passed as an input to an activation function to


produce the output. Activation functions choose whether a node should fire or not.
Only those who are fired make it to the output layer. There are distinctive activation
functions available that can be applied upon the sort of task we are performing.

Types of Artificial Neural Networks


There are two types of Artificial Neural Network

1. Feed-Forward ANN
2. FeedBack ANN
1. Feed-Forward ANN: In this ANN, the information flow is unidirectional. A unit
sends information to other unit from which it does not receive any information.
There are no feedback loops. They are used in pattern generation, recognition or
classification. They have fixed inputs and outputs..

2. FeedBack ANN: Here, feedback loops are allowed. In this type of ANN, the output
returns into the network to accomplish the best-evolved results internally. The
feedback networks feed information back into itself and are well suited to solve
optimization issues. The Internal system error corrections utilize feedback ANNs.
They are used in content addressable memories.

Artificial Neural Network Algorithms


o Bayesian Networks
o Genetic Algorithm
o Back Propagation Algorithm
Advantages of Artificial Neural Network
o Parallel processing capability: Artificial neural networks have a numerical
value that can perform more than one task simultaneously.
o Storing data on the entire network: Data that is used in traditional
programming is stored on the whole network, not on a database. The
disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
o Capability to work with incomplete knowledge: After ANN training, the
information may produce output even with inadequate data. The loss of
performance here relies upon the significance of missing data.
o Having fault tolerance: Extortion of one or more cells of ANN does not
prohibit it from generating output, and this feature makes the network fault-
tolerance.

Disadvantages of Artificial Neural Network


o Assurance of proper network structure: There is no particular guideline for
determining the structure of artificial neural networks. The appropriate
network structure is accomplished through experience, trial, and error.
o Hardware dependence: Artificial neural networks need processors with
parallel processing power, as per their structure. Therefore, the realization of
the equipment is dependent.
o Unrecognized behavior of the network: It is the most significant issue of ANN.
When ANN produces a testing solution, it does not provide insight concerning
why and how. It decreases trust in the network.
o The duration of the network is unknown: The network is reduced to a specific
value of the error, and this value does not give us optimum results.
o Difficulty of showing the issue to the network: ANNs can work with numerical
data. Problems must be converted into numerical values before being
introduced to ANN. The presentation mechanism to be resolved here will
directly impact the performance of the network. It relies on the user's
abilities.

Applications of Neural Networks


o Aerospace − Autopilot aircrafts, aircraft fault detection.
o Automotive − Automobile guidance systems.
o Military − Weapon orientation and steering, target tracking, object
discrimination, facial recognition, signal/image identification.
o Speech − Speech recognition and classification, text to speech conversion.
o Transportation − Truck Brake system diagnosis, vehicle scheduling, routing
systems.
o Software − Pattern Recognition in facial recognition, optical character
recognition, etc
o Signal Processing − Neural networks can be trained to process an audio signal
and filter it appropriately in the hearing aids.
o Industrial − Manufacturing process control, product design and analysis,
quality inspection systems, welding quality analysis, paper quality prediction,
machine maintenance analysis, project bidding, planning, and management.
o Medical − Cancer cell analysis, EEG and ECG analysis, prosthetic design,
transplant time optimizer.
o Electronics − Code sequence prediction, IC chip layout, chip failure analysis,
machine vision, voice synthesis.
o Financial − Real estate appraisal, loan advisor, corporate bond rating,
portfolio trading program, corporate financial analysis, currency value
prediction, document readers, credit application evaluators.
o Telecommunications − Image and data compression, automated information
services, real-time spoken language translation.
o Control − ANNs are often used to make steering decisions of physical
vehicles.

Types of neural networks models


1. Feed Forward Neural Network

a) Single Layer Feed Forward Neural Network (Single-layer Perceptron)


b) Multilayer Feed Forward Neural Network (Multi-layer Perceptron)

2. Radial Basis Functional Neural Network


3. Recurrent Neural Network
4. Convolutional Neural Network
5. Modular Neural Network
6. Sequence to Sequence Models
Radial Basis Functional Neural Network
Radial Basis Functional Neural Network: RBFNs are specific types of neural
networks that follow a feed-forward approach and make use of radial functions as
activation functions.

An RBFN performs classification by measuring the input’s similarity to


examples from the training set. Each RBFN neuron stores a “prototype”,
which is just one of the examples from the training set. When we want to
classify a new input, each neuron computes the Euclidean distance
between the input and its prototype. Roughly speaking, if the input more
closely resembles the class A prototypes than the class B prototypes, it is
classified as class A.
Using a set of prototypes along with other training examples, neurons look
at the distance between an input and a prototype, using what is called an
input vector.
Which are mostly used for time-series prediction, regression
testing, and classification.

RBFNN Network Architecture


Following illustration shows the typical architecture of an RBF Network. It consists of
an input vector, a layer of RBF neurons, and an output layer with one node per
category or class of data.
The Input Vector

The input vector is the n-dimensional vector that you are trying to classify. The entire
input vector is shown to each of the RBF neurons.

The RBF Neurons

Each RBF neuron stores a “prototype” vector which is just one of the vectors from
the training set. Each RBF neuron compares the input vector to its prototype, and
outputs a value between 0 and 1 which is a measure of similarity.

The shape of the RBF neuron’s response is a bell curve, as illustrated in the network
architecture diagram.

The neuron’s response value is also called its “activation” value.

The prototype vector is also often called the neuron’s “center”, since it’s the value at
the center of the bell curve.

The Output Nodes

The output of the network consists of a set of nodes, one per category that we are
trying to classify. Each output node computes a sort of score for the associated
category. Typically, a classification decision is made by assigning the input to the
category with the highest score.

The score is computed by taking a weighted sum of the activation values from every
RBF neuron.

There are different possible choices of similarity functions, but the most popular is
based on the Gaussian. Below is the equation for a Gaussian with a one-dimensional
input.

The hidden layer contains Gaussian transfer functions that are inversely proportional
to the distance of the output from the neuron's center.

Where x is the input, mu is the mean, and sigma is the standard deviation. This
produces the familiar bell curve shown below, which is centered at the mean, mu (in
the below plot the mean is 5 and sigma is 1).
The RBF neuron activation function is slightly different, and is typically written as:

RBFN as a Neural Network

Below is another version of the RBFN architecture diagram.

Here the RBFN is viewed as a “3-layer network” where the input vector is the first
layer, the second “hidden” layer is the RBF neurons, and the third layer is the output
layer containing linear combination neurons.
Recurrent Neural Network
Recurrent Neural Network(RNN): Recurrent Neural Network(RNN) are a type
of Neural Network where the output from previous step are fed as input to the
current step.

The recurrent neural network is a type of ANN algorithm, which follows a


sequential approach. In neural networks, we always assume that each input
and output is dependent on all other layers. These types of neural networks
are called recurrent because they sequentially perform mathematical
computations.
A recurrent neural network looks similar to a traditional neural network
except that a memory-state is added to the neurons. The computation is to
include a simple memory.
RNN have a “memory” which remembers all information about what has been
calculated. It uses the same parameters for each input as it performs the same
task on all the inputs or hidden layers to produce the output. This reduces the
complexity of parameters, unlike other neural networks.
Recurrent Networks are designed to recognize patterns in sequences of data,
such as text, genomes, handwriting, the spoken word, and numerical time
series data emanating from sensors, stock markets, and government agencies.
A recurrent neural network (RNN) is a kind of artificial neural network mainly
used in speech recognition and natural language processing (NLP).
A recurrent neural network uses a backpropagation algorithm for training, but
backpropagation happens for every timestamp, which is why it is commonly
called as backpropagation through time.
How RNN works

 RNN converts the independent activations into dependent activations by


providing the same weights and biases to all the layers, which reduces the
complexity of increasing parameters. And it provides a standard platform for
memorization of the previous outputs by providing previous output as an input
to the next layer.
 Hence all the layers can be joined together such that the weights and bias of all
the hidden layers is the same, into a single recurrent layer.

 Formula for calculating current state:

Where: ht -> current state

ht-1 -> previous state

xt -> input state

 Formula for applying Activation function(tanh):

Where: whh -> weight at recurrent neuron

wxh -> weight at input neuron

 Formula for calculating output:

Where: Yt -> output

Why -> weight at output layer


Training through RNN
1. A single time step of the input is provided to the network.
2. Then calculate its current state using set of current input and previous state.
3. The current ht becomes ht-1 for the next time step.
4. One can go as many time steps according to the problem and join the
information from all the previous states.
5. Once all the time steps are completed the final current state is used to
calculate the output.
6. The output is then compared to the actual output i.e the target output and
the error is generated.
7. The error is then back-propagated to the network to update the weights and
hence the network (RNN) is trained.

The Activation Function


We can use any activation function we like in the RNN. Common choices are:

 Sigmoid Function
 Tanh Function
 Relu Function

Types of RNNs
There are different types of RNNs with varying architectures. Some examples are:

One To One: Here there is a single (xt,yt) pair. Traditional neural networks employ a
one to one architecture.

One To Many: In one to many networks, a single input at xt can produce multiple
outputs, e.g., (yt0,yt1,yt2). Music generation is an example area, where one to many
networks are employed.
Many To One: In this case many inputs from different time steps produce a single
output. For example, (xt,xt+1,xt+2) can produce a single output yt. Such networks are
employed in sentiment analysis or emotion detection, where the class label depends
upon a sequence of words.

Many To Many: There are many possibilities for many to many. An example is shown
above, where two inputs produce three outputs. Many to many networks are applied
in machine translation, e.g, English to French or vice versa translation systems.

Advantages of Recurrent Neural Network


1. An RNN remembers each and every information through time. It is useful in
time series prediction only because of the feature to remember previous
inputs as well. This is called Long Short Term Memory.

2. Recurrent neural network are even used with convolutional layers to extend
the effective pixel neighborhood.
Disadvantages of Recurrent Neural Network
1. Gradient vanishing and exploding problems.
2. Training an RNN is a very difficult task.
3. It cannot process very long sequences if using tanh or relu as an activation
function.

Applications of Recurrent Neural Network


1. Machine Translation: We make use of Recurrent Neural Networks in the
translation engines to translate the text from one to another language. They do this
with the combination of other models like LSTM (Long short-term memory)s.

2. Speech Recognition: Recurrent Neural Networks has replaced the traditional


speech recognition models that made use of Hidden Markov Models. These
Recurrent Neural Networks, along with LSTMs, are better poised at classifying
speeches and converting them into text without loss of context.
3. Sentiment Analysis: We make use of sentiment analysis to positivity, negativity, or
the neutrality of the sentence. Therefore, RNNs are most adept at handling data
sequentially to find sentiments of the sentence.
4. Automatic Image Tagger: RNNs, in conjunction with convolutional neural
networks, can detect the images and provide their descriptions in the form of tags.
For example, a picture of a fox jumping over the fence is better explained
appropriately using RNNs.

Different RNN Architectures


There are different variations of RNNs that are being applied practically in machine
learning problems:
Bidirectional recurrent neural networks (BRNN): In BRNN, inputs from future time
steps are used to improve the accuracy of the network. It is like having knowledge of
the first and last words of a sentence to predict the middle words.

Gated Recurrent Units (GRU): These networks are designed to handle the vanishing
gradient problem. They have a reset and update gate. These gates determine which
information is to be retained for future predictions.
Long Short Term Memory (LSTM): LSTMs were also designed to address the
vanishing gradient problem in RNNs. LSTM use three gates called input, output and
forget gate. Similar to GRU, these gates determine which information to retain.
Convolutional Neural Network
Convolutional Neural Network: Convolutional Neural Network is one of the
technique to do image classification and image recognition in neural networks. It is
designed to process the data by multiple layers of arrays. The primary difference
between CNN and other neural network is that CNN takes input as a two-
dimensional array. And it operates directly on the images rather than focusing on
feature extraction which other neural networks do.

Convolutional Neural Network (CNN or ConvNet) is a type of feed-forward


artificial network where the connectivity pattern between its neurons is
inspired by the organization of the animal visual cortex.
The visual cortex has a small region of cells that are sensitive to specific
regions of the visual field. Some individual neuronal cells in our brain respond
in the presence of edges of a certain orientation.
The Convolutional Neural Networks, which are also called as covnets, are
nothing but neural networks, sharing their parameters.
With three or four convolutional layers it is viable to recognize handwritten
digits and with 25 layers it is possible to differentiate human faces.

CNN takes an image as input, which is classified and process under a certain category
such as dog, cat, lion, tiger, etc. The computer sees an image as an array of pixels and
depends on the resolution of the image. Based on image resolution, it will see as h *
w * d, where h= height w= width and d= dimension. For example, An RGB image is 6
* 6 * 3 array of the matrix, and the grayscale image is 4 * 4 * 1 array of the matrix.

In CNN, each input image will pass through a sequence of convolution layers along
with pooling, fully connected layers, filters (Also known as kernels). After that, we
will apply the Soft-max function to classify an object with probabilistic values 0 and 1.
How Does a Computer read an image?
The image is broken into 3 color-channels which is Red, Green, and Blue. Each of
these color channels is mapped to the image's pixel.

Some neurons fires when exposed to vertices edges and some when shown
horizontal or diagonal edges. CNN utilizes spatial correlations which exist with the
input data. Each concurrent layer of the neural network connects some input
neurons. This region is called a local receptive field. The local receptive field focuses
on hidden neurons.

The hidden neuron processes the input data inside the mentioned field, not realizing
the changes outside the specific boundary.

Layers in Convolutional Neural Networks


Convolutional Neural Networks have the following 4 layers:

o Convolutional layer
o ReLU Layer
o Pooling
o Fully Connected

1. Convolution Layer
Convolution layer is the first layer to extract features from an input image. By
learning image features using a small square of input data, the convolutional layer
preserves the relationship between pixels. It is a mathematical operation which takes
two inputs such as image matrix and a kernel or filter.
o The dimension of the image matrix is h×w×d.
o The dimension of the filter is fh×fw×d.
o The dimension of the output is (h-fh+1)×(w-fw+1)×1.

Let's start with consideration a 5*5 image whose pixel values are 0, 1, and filter
matrix 3*3 as:

The convolution of 5*5 image matrix multiplies with 3*3 filter matrix is called
"Features Map" and show as an output.

Convolution of an image with different filters can perform an operation such as blur,
sharpen, and edge detection by applying filters.

Strides
Stride is the number of pixels which are shift over the input matrix. When the stride
is equaled to 1, then we move the filters to 1 pixel at a time and similarly, if the stride
is equaled to 2, then we move the filters to 2 pixels at a time. The following figure
shows that the convolution would work with a stride of 2.
Padding
Padding plays a crucial role in building the convolutional neural network. If the image
will get shrink and if we will take a neural network with 100's of layers on it, it will
give us a small image after filtered in the end.

If we take a three by three filter on top of a grayscale image and do the convolving
then what will happen?

It is clear from the above picture that the pixel in the corner will only get covers one
time, but the middle pixel will get covered more than once. It means that we have
more information on that middle pixel, so there are two downsides:

o Shrinking outputs
o Losing information on the corner of the image.

To overcome this, we have introduced padding to an image. "Padding is an


additional layer which can add to the border of an image."
2. ReLU Layer
Rectified Linear unit(ReLU) transform functions only activates a node if the input is
above a certain quantity. While the data is below zero, the output is zero, but when
the input rises above a certain threshold. It has a linear relationship with the
dependent variable.

In this layer, we remove every negative value from the filtered images and replaces
them with zeros.

It is happening to avoid the values from adding up to zero.

3. Pooling Layer
Pooling layer plays a vital role in pre-processing of any image. Pooling layer reduces
the number of the parameter when the image is too large. Pooling
is "downscaling" of the image achieved from previous layers. It can be compared to
shrink an image to reduce the image's density. Spatial pooling is also called
downsampling and subsampling, which reduce the dimensionality of each map but
remains essential information. These are the following types of spatial pooling.

We do this by implementing the following 4 steps:

o Pick a window size (usually 2 or 3)


o Pick a stride (usually 2)
o Walk your window across your filtered images
o From each window, take the maximum value
Max Pooling: Max pooling is a sample-based discretization process. The main
objective of max-pooling is to downscale an input representation, reducing its
dimension and allowing for the assumption to be made about feature contained in
the sub-region binned.

Max pooling is complete by applying a max filter in non-overlapping sub-regions of


initial representation.

Average Pooling: Down-scaling will perform by average pooling by dividing the input
into rectangular pooling regions and computing the average values of each area.

Sum Pooling: The sub-region for sum pooling and mean pooling are set the same as
for max-pooling but instead of using the max function we use sum or mean.
Flattening
Flattening is nothing but converting a 3D or 2D matrix into a 1D input for the model
this will be our last step to process the image and connect the inputs to a fully
connected dense layer for further classification.

4. Fully Connected (Dense) Layer


The fully connected layer (dense layer) is a layer where the input from other layers
will be depressed into the vector. It will transform the output into any desired
number of classes into the network.
In the above diagram, the map matrix is converted into the vector such as x1, x2,
x3... xn with the help of a fully connected layer. We will combine features to create
any model and apply activation function like as softmax or sigmoid to classify the
outputs as a car, dog, truck, etc. This is the final where the actual classification
happen

Applications of Convolutional Neural Network (CNN)


 Image Recognition
 Face Recognition
 Scene Labelling
 Object Detection

Activation Functions in ANN


Activation Function: Activation function decides, whether a neuron should be
activated or not by calculating weighted sum and further adding bias with it.

The purpose of the activation function is to introduce non-linearity into the


output of a neuron.
A neural network without an activation function is essentially just a linear
regression model.
The activation function does the non-linear transformation to the input
making it capable to learn and perform more complex tasks.
Types of Activation Functions

1). Linear Function

 Equation: Linear function has the equation similar to as of a straight line i.e.
y=x
 No matter how many layers we have, if all are linear in nature, the final
activation function of last layer is nothing but just a linear function of the
input of first layer.
 Range: -inf to +inf
 Uses: Linear activation function is used at just one place i.e. output layer.
 Issues: If we will differentiate linear function to bring non-linearity, result will
no more depend on input “x” and function will become constant, it won’t
introduce any ground-breaking behavior to our algorithm.

2). Sigmoid Function

 It is a function which is plotted as ‘S’ shaped graph.


 Equation: A = 1/(1 + e-x)
 Nature: Non-linear. Notice that X values lies between -2 to 2, Y values are very
steep. This means, small changes in x would also bring about large changes in
the value of Y.
 Value Range: 0 to 1
 Uses: Usually used in output layer of a binary classification, where result is
either 0 or 1, as value for sigmoid function lies between 0 and 1 only so, result
can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
 It is especially used for models where we have to predict the probability as an
output. Since probability of anything exists only between the range of 0 and
1, sigmoid is the right choice.
 If your output is for binary classification then, sigmoid function is very natural
choice for output layer.

3). Tanh Function

 The activation that works almost always better than sigmoid function is Tanh
function also knows as Tangent Hyperbolic function. It’s actually
mathematically shifted version of the sigmoid function. Both are similar and
can be derived from each other.
 Equation: f(x) = tanh(x) = 2/(1 + e-2x) – 1 OR tanh(x) = 2 * sigmoid(2x) - 1
 Value Range: -1 to +1
 Nature: non-linear
 Uses: Usually used in hidden layers of a neural network as it’s values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0. This
makes learning for the next layer much easier.
 The tanh functions have been used mostly in RNN for natural language
processing and speech recognition tasks.
4). ReLu (Rectified Linear Unit) Function:

 The ReLU is the most used activation function in the world right now.Since, it
is used in almost all the convolutional neural networks or deep learning.
 Equation: A(x) = max(0,x). It gives an output x if x is positive and 0 otherwise.
 Value Range: [0, inf)
 Nature: non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
 Uses: ReLu is less computationally expensive than tanh and sigmoid because it
involves simpler mathematical operations. At a time only a few neurons are
activated making the ANN sparse making it efficient and easy for computation.
 In simple words, RELU learns much faster than sigmoid and Tanh function.
 The basic rule of thumb is if you really don’t know what activation function to
use, then simply use RELU as it is a general activation function in hidden layers
and is used in most cases these days.
 It easily overfits compared to the sigmoid function and is one of the main
limitations. Some techniques like dropout are used to reduce the overfitting.
5). Softmax Function:

 The softmax function is also a type of sigmoid function but is handy when we
are trying to handle mult- class classification problems.
 Nature: non-linear
 Uses: Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
 Output: The softmax function is ideally used in the output layer of the
classifier where we are actually trying to attain the probabilities to define the
class of each input.
 If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.

Decision Tree
Decision Tree: Decision Tree is a Supervised learning technique that can be used for
both classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.

A Decision tree is a flowchart-like tree structure, where each internal node


denotes a test on an attribute, each branch represents an outcome of the test,
and each leaf node (terminal node) holds a class label.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and do not
contain any further branches.

The decisions or the test are performed on the basis of features of the given
dataset.

It is a graphical representation for getting all the possible solutions to a


problem/decision based on given conditions.

It is called a decision tree because, similar to a tree, it starts with the root
node, which expands on further branches and constructs a tree-like structure.

In order to build a tree, we use the CART algorithm, which stands


for Classification and Regression Tree algorithm.

A decision tree simply asks a question, and based on the answer (Yes/No), it
further split the tree into subtrees.

A decision tree can contain categorical data (YES/NO) as well as numeric data.

Below diagram explains the general structure of a decision tree:


Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.

Branch/Sub Tree: A tree formed by splitting the tree.

Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work


In a decision tree, for predicting the class of the given dataset, the algorithm starts
from the root node of the tree. This algorithm compares the values of root attribute
with the record (real dataset) attribute and, based on the comparison, follows the
branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other
sub-nodes and move further. It continues the process until it reaches the leaf node of
the tree. The complete process can be better understood using the following
algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for the best
attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset
created in step-3. Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide
whether he should accept the offer or Not. So, to solve this problem, the decision
tree starts with the root node (Salary attribute by ASM). The root node splits further
into the next decision node (distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into one decision
node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf
nodes (Accepted offers and Declined offer). Consider the below diagram:

Attribute Selection Measures


While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this
measurement, we can easily select the best attribute for the nodes of the tree. There
are two popular techniques for ASM, which are:

o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the
segmentation of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the
decision tree.
o A decision tree algorithm always tries to maximize the value of information
gain, and a node/attribute having the highest information gain is split first. It
can be calculated using the below formula:

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature) ]

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision
tree in the CART(Classification and Regression Tree) algorithm.
o Gini Index is a score that evaluates how accurate a split is among the classified
groups. Gini index evaluates a score in the range between 0 and 1, where 0 is
when all observations belong to one class, and 1 is a random distribution of
the elements within classes.
o An attribute with the low Gini index should be preferred as compared to the
high Gini index. We prefer to have a Gini index score as low as possible.
o It only creates binary splits, and the CART algorithm uses the Gini index to
create binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2


Pruning: Getting an Optimal Decision tree
Pruning: Pruning is a process of deleting the unnecessary nodes from a tree in order
to get the optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all
the important features of the dataset. Therefore, a technique that decreases the size
of the learning tree without reducing accuracy is known as Pruning. There are mainly
two types of tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of Decision Tree

o Decision Trees usually mimic human thinking ability while making a decision,
so it is easy to understand.
o The logic behind the decision tree can be easily understood because it shows a
tree-like structure.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
o Decision trees perform classification without requiring much computation.
o Decision trees are able to handle both continuous and categorical variables.
o Decision trees provide a clear indication of which fields are most important
for prediction or classification.

Disadvantages of Decision Tree

o The decision tree contains lots of layers, which makes it complex.


o It may have an overfitting issue, which can be resolved using the Random
Forest algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.
o Decision trees are less appropriate for estimation tasks where the goal is to
predict the value of a continuous attribute.
o Decision trees are prone to errors in classification problems with many
classes and a relatively small number of training examples.
o Decision tree can be computationally expensive to train. The process of
growing a decision tree is computationally expensive. At each node, each
candidate splitting field must be sorted before its best split can be found.

You might also like