0% found this document useful (0 votes)
48 views50 pages

Unit 5

The document discusses neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, components, and functioning. It highlights the advantages and disadvantages of artificial neural networks, such as parallel processing capability and the challenge of determining proper network structure. Additionally, it explains the workings of single-layer and multi-layer perceptron models, including their training algorithms and applications in solving complex problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views50 pages

Unit 5

The document discusses neural networks, focusing on perceptrons and multilayer perceptrons, including their structure, components, and functioning. It highlights the advantages and disadvantages of artificial neural networks, such as parallel processing capability and the challenge of determining proper network structure. Additionally, it explains the workings of single-layer and multi-layer perceptron models, including their training algorithms and applications in solving complex problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

UNIT V NEURAL NETWORKS

Perceptron - Multilayer perceptron, activation functions, network training – gradient


descent optimization – stochastic gradient descent, error backpropagation, from shallow
networks to deep networks –Unit saturation (aka the vanishing gradient problem) – ReLU,
hyperparameter tuning, batch normalization, regularization, dropout.

5.1 Perceptron, Multilayer Perceptron


5.1.1 Introduction to Neural Networks
The term "Artificial Neural Network" is derived from Biological neural networks that
develop the structure of a human brain. Similar to the human brain that has neurons
interconnected to one another; artificial neural networks also have neurons that are
interconnected to one another in various layers of the networks. These neurons are known as
nodes.
The given figure illustrates the typical diagram of Biological Neural Network.

Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals. The Neurons, then again, use electrical signals to store information, and to
make decisions based on previous input.
Frank had the idea that Perceptrons could simulate brain principles, with the ability to
learn and make decisions. In 1957 he started something really big. Frank "invented"
a Perceptron program, on an IBM 704 computer at Cornell Aeronautical Laboratory.
5.1.1.1 Definition of Perceptron
A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network.
5.1.1.2 What is Artificial Neuron
An artificial neuron is a mathematical function based on a model of biological neurons,
where each neuron takes inputs, weighs them separately, sums them up and passes this sum
through a nonlinear function to produce output.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


1
5.1.1.3 Biological Neuron vs. Artificial Neuron
The biological neuron is analogous to artificial neurons in the following terms:

Biological Neuron Artificial Neuron

Cell Nucleus (Soma) Node

Dendrites Input

Synapse Weights or interconnections

Axon Output

The typical Artificial Neural Network looks something like the given figure.

5.1.1.4 Layers of Artificial Neural Network


Artificial Neural Network primarily consists of three layers:

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


2
1. Input Layer:
As the name suggests, it accepts inputs in several different formats provided by the
programmer.
2. Hidden Layer:
The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
3. Output Layer:
The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function. It
determines weighted total is passed as an input to an activation function to produce the output.
Activation functions choose whether a node should fire or not. Only those who are fired make it to
the output layer.
5.1.1.5 Advantages of Artificial Neural Network (ANN)
1. Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one
tasksimultaneously.
2. Storing data on the entire network:
Data that is used in traditional programming is stored on the whole network, not on a
database. The disappearance of a couple of pieces of data in one place doesn't prevent the
network from working.
3. Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate data.
The loss of performance here relies upon the significance of missing data.
4. Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these examples to the
network. The succession of the network is directly proportional to the chosen instances, and if the
event can't appear to the network in all its aspects, it can produce false output.
5. Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating output,
andthis feature makes the network fault-tolerance.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


3
5.1.16 Disadvantages of Artificial Neural Network:
1. Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through experience, trial, and
error.
2. Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it
doesnot provide insight concerning why and how. It decreases trust in the network.
3. Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per
theirstructure. Therefore, the realization of the equipment is dependent.
4. Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved here
willdirectly impact the performance of the network. It relies on the user's abilities.
5. The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does not give
usoptimum results.
5.1.2 Perceptron
Perceptron was introduced by Frank Rosenblatt in 1957. He proposed a Perceptron
learning rule based. A Perceptron is an algorithm for supervised learning of binary classifiers.
This algorithm enables neurons to learn and processes elements in the training set one at a time.
Perceptron is a building block of an Artificial Neural Network. Perceptron is also
understood as an Artificial Neuron or neural network unit that helps to detect certain input data
computations in business intelligence. Perceptron model is also treated as one of the best and
simplest types of Artificial Neural networks. However, it is a supervised learning algorithm of
binary classifiers.
5.1.2.1 Basic Components of Perceptron
Perceptron is a type of artificial neural network, which is a fundamental concept in
machine learning. The basic components of a perceptron are:

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


4
Input Layer: The input layer consists of one or more input neurons, which receive input
signals from the external world or from other layers of the neural network.
1. Weights: Each input neuron is associated with a weight, which represents the strength of
the connection between the input neuron and the output neuron.
2. Bias: A bias term is added to the input layer to provide the perceptron with additional
flexibility in modeling complex patterns in the input data.
3. Activation Function: The activation function determines the output of the perceptron
based on the weighted sum of the inputs and the bias term. Common activation functions
used in perceptrons include the step function, sigmoid function, and ReLU function.
4. Output: The output of the perceptron is a single binary value, either 0 or 1, which
indicates the class or category to which the input data belongs.
5. Training Algorithm: The perceptron is typically trained using a supervised learning
algorithm such as the perceptron learning algorithm or backpropagation. During training,
the weights and biases of the perceptron are adjusted to minimize the error between the
predicted output and the true output for a given set of training examples.
5.1.2.2 How does Perceptron work?
Perceptron is considered as a single-layer neural network that consists of four main
parameters named input values (Input nodes), weights and Bias, net sum, and an activation
function.
The perceptron model begins with the multiplication of all input values and their weights, then
adds these values together to create the weighted sum. Then this weighted sum is applied to the
activation function 'f' to obtain the desired output. This activation function is also known as the
step function and is represented by 'f'.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


5
Perceptron model works in two important steps as follows:
Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
5.1.2.3 The Perceptron Algorithm
1. Set a threshold value
2. Multiply all inputs with its weights
3. Sum all the results
4. Activate the output
5.1.2.3.1 Example:
1. Set a threshold value: Threshold = 1.5
2. Multiply all inputs with its weights:
x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
3. Sum all the results:
0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)
4. Activate the Output:
Return true if the sum > 1.5 ("Yes I will go to the Concert")
5.1.2.4 Types of Perceptron Models
Based on the layers, Perceptron models are divided into two types. These are as follows:
1. Single-layer Perceptron Model
2. Multi-layer Perceptron model

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


6
5.1.3 Single Layer Perceptron Model:
This is one of the easiest Artificial neural networks (ANN) types. A single-layered
perceptron model consists feed-forward network and also includes a threshold transfer function
inside the model.
The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes. In a single layer perceptron model, its algorithms do not
contain recorded data, so it begins with inconstantly allocated input for weight parameters.
Further, it sums up all inputs (weight). After adding all inputs, if the total sum of all
inputs is more than a pre-determined value, the model gets activated and shows the output value
as +1.
If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change. However, this model consists
of a few discrepancies triggered when multiple weight inputs values are fed into the model.
Hence, to find desired output and minimize errors, some changes should be necessary for
the weights input. "Single-layer perceptron can learn only linearly separable patterns."
5.1.3.1 Perceptron Function
Perceptron function ''f(x)'' can be achieved as output by multiplying the input 'x' with the
learned weight coefficient 'w'. Mathematically, we can express it as follows:
f(x)=1; if w.x+b>0
otherwise, f(x)=0 'w'
represents real-valued weights vector 'b' represents the bias 'x' represents a vector of input x
values.
5.1.3.2 Characteristics of Perceptron
The perceptron model has the following characteristics.
1. Perceptron is a machine learning algorithm for supervised learning of binary classifiers.
2. In Perceptron, the weight coefficient is automatically learned.
3. Initially, weights are multiplied with input features, and the decision is made whether the
neuron is fired or not.
4. The activation function applies a step rule to check whether the weight function is
greater than zero.
5. The linear decision boundary is drawn, enabling the distinction between the two linearly
separable classes +1 and -1.
6. If the added sum of all input values is more than the threshold value, it must have an
output signal; otherwise, no output will be shown.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


7
5.1.3.3 Limitations of Perceptron Model
A perceptron model has limitations as follows:
 The output of a perceptron can only be a binary number (0 or 1) due to the hard
limit transfer function.
 Perceptron can only be used to classify the linearly separable sets of input
vectors. Ifinput vectors are non-linear, it is not easy to classify them properly.
5.1.4 Multi-Layered Perceptron Model:
Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
The multi-layer perceptron model is also known as the Backpropagation algorithm,
which executes in two stages as follows:
1. Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
2. Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
Hence, a multi-layered perceptron model has considered as multiple artificial neural
networks having various layers in which activation function does not remain linear, similar to a
single layer perceptron model.
Instead of linear, activation function can be executed as sigmoid, TanH, ReLU, etc., for
deployment.
A multi-layer perceptron model has greater processing power and can process linear and
non-linear patterns. Further, it can also implement logic gates such as AND, OR, XOR, NAND,
NOT, XNOR, NOR.
It deals with training multi-layer artificial neural networks, also called Deep Neural
Networks. The backpropagation algorithm is used to train a multilayer neural network.
5.1.4.1 Multi-layer ANN
A fully connected multi-layer neural network is called a Multilayer Perceptron (MLP). It
has 3 layers including one hidden layer. If it has more than 1 hidden layer, it is called a deep
ANN.
An MLP is a typical example of a feedforward artificial neural network. The number of
layers and the number of neurons are referred to as hyperparameters of a neural network, and
these need tuning. The weight adjustment training is done via backpropagation. Deeper neural

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


8
networks are better at processing data. However, deeper layers can lead to vanishing gradient
problems. Special algorithms are required to solve this issue.

5.1.4.2 Notations
In the representation below:

ai(in) refers to the ith value in the input layer, ai(h) refers to the ith unit in the hidden
layer, and ai(out) refers to the ith unit in the output layer. ao(in) is simply the bias unit and is
equal to 1; it will have the corresponding weight w0. The weight coefficient from layer l to layer
l+1 is represented by wk,j(l)
5.1.4.3 Advantages of Multi-Layer Perceptron:
1. A multi-layered perceptron model can be used to solve complex non-linear problems.
2. It works well with both small and large input data.
3. It helps us to obtain quick predictions after the training.
4. It helps to obtain the same accuracy ratio with large as well as small data.
5.1.4.4 Disadvantages of Multi-Layer Perceptron:
1. In Multi-layer perceptron, computations are difficult and time-consuming.
2. In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
3. The model functioning depends on the quality of the training.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


9
5.2 Activation Function, Network Training
5.2.1 Definition of Activation Function
The activation function decides whether a neuron should be activated or not by calculating the
weighted sum and further adding bias to it. The purpose of the activation function is to introduce
non-linearity into the output of a neuron.
In artificial neural networks, an activation function is one that outputs a smaller value for
tiny inputs and a higher value if its inputs are greater than a threshold. An activation function"fires"
if the inputs are big enough; otherwise, nothing happens.
An activation function, then, is a gate that verifies how an incoming value is higher than a
threshold value. Because they introduce non-linearities in neural networks and enable the neural
networks can learn powerful operations, activation functions are helpful.
A feedforward neural network might be refactored into a straightforward linear function or
matrix transformation on to its input if indeed the activation functions were taken out.
By generating a weighted total and then including bias with it, the activation function
determines whether a neuron should be turned on. The activation function seeks to boost a
neuron's output's nonlinearity.
5.2.1.1 Explanation:
As we are aware, neurons in neural networks operate in accordance with weight, bias, and
their corresponding activation functions. Based on the mistake, the values of the neurons inside a
neural network would be modified. This process is known as back- propagation.
Back-propagation is made possible by activation functions since they provide the
gradients and error required to change the biases and weights.
5.2.1.2 Need of Non-linear Activation Functions
An interconnected regression model without an activation function is all that a neural
network is. Input is transformed nonlinearly by the activation function, allowing the system to
learn and perform more challenging tasks.
It is merely a thing procedure that is used to obtain a node's output. It also goes by the
name Transfer Function.
The mixture of two linear functions yields a linear function, so no matter how several
hidden layers we add to a neural network, they all will behave in the same way. The neuron
cannot learn if all it has is a linear model. It will be able to learn based on the difference with
respect to error with a non-linear activation function.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


10
The mixture of two linear functions yields a linear function in itself, so no matter how
several hidden layers we add to a neural network, they all will behave in the same way. The
neuron cannot learn if all it has is a linear model.
The two main categories of activation functions are:
1. Linear Activation Function
2. Non-linear Activation Functions
5.2.2 Linear Activation Function
As can be observed, the functional is linear. Therefore, no region will be employed to
restrict the functions' output. The normal data input to neural networks is unaffected by the
complexity or other factors.

5.2.3 Non-linear Activation Function


The Nonlinear Activation Functions are the most used activation functions. Nonlinearity
helps to makes the graph look something like this. The activation function does the non-linear
transformation to the input making it capable to learn and perform more complex tasks.

The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-
5.2.3.1 Types of Non-linear Activation Function
1. Sigmoid or Logistic Activation Function
The Sigmoid Function curve looks like a S-shape.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


11
The main reason why we use sigmoid function is because it exists between (0 to1).
Therefore, it is especially used for models where we have to predict the probability as an
output.Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice.
The function is differentiable.That means, we can find the slope of the sigmoid curve at
any two points.
2. Tanh or hyperbolic tangent Activation Function
Tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to
1). tanh is also sigmoidal (s - shaped).

The advantage is that the negative inputs will be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh graph.
The function is differentiable. The function is monotonic while its derivative is not
monotonic. The tanh function is mainly used classification between two classes.
3. ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function in the world right now.Since, it is
used in almost all the convolutional neural networks or deep learning.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


12
The ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z) is
equal to z when z is above or equal to zero.
Range: [ 0 to infinity] The function and its derivative both are monotonic.
But the issue is that all the negative values become zero immediately which decreases the
ability of the model to fit or train from the data properly. That means any negative input given to
the ReLU activation function turns the value into zero immediately in the graph, which in turns
affects the resulting graph by not mapping the negative values appropriately.
4. Leaky ReLU
It is an attempt to solve the dying ReLU problem

The leak helps to increase the range of the ReLU function. Usually, the value of ‗a’ is 0.01
or so.
When ‗a’ is not 0.01 then it is called Randomized ReLU. Therefore the range of the Leaky
ReLU is (-infinity to infinity). Both Leaky and Randomized ReLU functions are monotonic in
nature. Also, their derivatives are monotonic in nature.
5.2.4 Training Network
Neural network training is the process of teaching a neural network to perform a task.
Neural networks learn by initially processing several large sets of labeled or unlabeled data. By
using these examples, they can then process unknown inputs more accurately.
5.2.4.1 Supervised learning
In supervised learning, data scientists give artificial neural networks labeled datasets that
provide the right answer in advance. For example, a deep learning network training in facial
recognition initially processes hundreds of thousands of images of human faces, with various
terms related to ethnic origin, country, or emotion describing each image.
The neural network slowly builds knowledge from these datasets, which provide the right
answer in advance. After the network has been trained, it starts making guesses about the ethnic
origin or emotion of a new image of a human face that it has never processed before.
Fitting a neural network involves using a training dataset to update the model weights to
create a good mapping of inputs to outputs.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


13
Training a neural network is an iterative process. In each iteration, we do a pass forward
through a model‘s layers to compute an output for each training example in a batch of data. Then
another pass proceeds backward through the layers, propagating how much each parameter
affects the final output by computing a gradient with respect to each parameter.
The average gradient for the batch, the parameters, and some per-parameter optimization
state is passed to an optimization algorithm, which computes the next iteration‘s parameters
(which should have slightly better performance on your data) and new per-parameter
optimization state.
As the training iterates over batches of data, the model evolves to produce increasingly
accurate outputs.
This training process is solved using an optimization algorithm that searches through a
space of possible values for the neural network model weights for a set of weights that results in
good performance on the training dataset.
The challenge of training a neural network framed as an optimization problem. Training a
neural network involves using an optimization algorithm to find a set of weights to best map
inputs to outputs.
The problem is hard, not least because the error surface is non-convex and contains local
minima, flat spots, and is highly multidimensional.
The stochastic gradient descent algorithm is the best general algorithm to address this
challenging problem.
5.2.4.2Backpropagation
Backpropagation is a training algorithm used for training feedforward neural networks.
It plays an important part in improving the predictions made by neural networks. This is because
backpropagation is able to improve the output of the neural network iteratively.
In a feedforward neural network, the input moves forward from the input layer to the
output layer. Backpropagation helps improve the neural network‘s output. It does this by
propagating the error backward from the output layer to the input layer.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


14
5.2.4.2.1 How Does Backpropagation Work?
To understand how backpropagation works, let‘s first understand how a feedforward
network works.
Feed Forward Networks
A feedforward network consists of an input layer, one or more hidden layers, and an
output layer. The input layer receives the input into the neural network, and each input has a
weight attached to it. The weights associated with each input are numerical values. These
weights are an indicator of the importance of the input in predicting the final output.
For example, an input associated with a large weight will have a greater influence on the
output than an input associated with a small weight.
When a neural network is first trained, it is first fed with input. Since the neural network
isn‘t trained yet, we don‘t know which weights to use for each input. And so, each input is
randomly assigned a weight. Since the weights are randomly assigned, the neural network will
likely make the wrong predictions. It will give out the incorrect output.

When the neural network gives out the incorrect output, this leads to an output error. This
error is the difference between the actual and predicted outputs. A cost function measures this
error.
The cost function (J) indicates how accurately the model performs. It tells us how far-off our
predicted output values are from our actual values. It is also known as the error. Because the cost
function quantifies the error, we aim to minimize the cost function.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


15
What we want is to reduce the output error. Since the weights affect the error, we will
need to readjust the weights. We have to adjust the weights such that we have a combination of
weights that minimizes the cost function. This is where Backpropagation comes in…
Backpropagation allows us to readjust our weights to reduce output error. The error is
propagated backward during backpropagation from the output to the input layer. This error is
then used to calculate the gradient of the cost function with respect to each weight.

Essentially, backpropagation aims to calculate the negative gradient of the cost function.
This negative gradient is what helps in adjusting of the weights. It gives us an idea of how we
need to change the weights so that we can reduce the cost function.
Backpropagation uses the chain rule to calculate the gradient of the cost function. The
chain rule involves taking the derivative. This involves calculating the partial derivative of each
parameter. These derivatives are calculated by differentiating one weight and treating the
other(s) as a constant. As a result of doing this, we will have a gradient. Since we have
calculated the gradients, we will be able to adjust the weights.
5.2.4.3 Learning as Optimization
Deep learning neural network models learn to map inputs to outputs given a training
dataset of examples.
The training process involves finding a set of weights in the network that proves to be
good, or good enough, at solving the specific problem.
This training process is iterative, meaning that it progresses step by step with small
updates to the model weights each iteration and, in turn, a change in the performance of the
model each iteration.
The iterative training process of neural networks solves an optimization problem that
finds for parameters (model weights) that result in a minimum error or loss when evaluating the
examples in the training dataset.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


16
Optimization is a directed search procedure and the optimization problem that we wish to
solve when training a neural network model is very challenging.
5.2.4.4 Challenging Optimization
Training deep learning neural networks is very challenging.
The best general algorithm known for solving this problem is stochastic gradient descent,
where model weights are updated each iteration using the backpropagation of error algorithm.
However, there are several optimization techniques that can be used to improve the
performance of Gradient Descent.
Here are some of the most popular optimization techniques for Gradient Descent:
1. Learning Rate Scheduling:
2. Momentum-based Updates:
3. Batch Normalization:
4. Weight Decay:
5. Adaptive Learning Rates:
6. Gradient Descent
5.3 Gradient descent optimization- stochastic gradient descent
5.3.1 Gradient Descent
Gradient Descent is known as one of the most commonly used optimization algorithms to
train machine learning models by means of minimizing errors between actual and expected
results. Further, gradient descent is also used to train Neural Networks.
The weights are adjusted using a process called gradient descent. Gradient descent is an
optimization algorithm that is used to find the weights that minimize the cost function.
Minimizing the cost function means getting to the minimum point of the cost function. So,
gradient descent aims to find a weight corresponding to the cost function‘s minimum point. To
find this weight, we must navigate down the cost function until we find its minimum point.
In mathematical terminology, Optimization algorithm refers to the task of
minimizing/maximizing an objective function f(x) parameterized by x.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


17
The main objective of gradient descent is to minimize the convex function using iteration
of parameter updates.
5.3.1.1 What is Gradient Descent or Steepest Descent?
Gradient Descent is defined as one of the most commonly used iterative optimization
algorithms of machine learning to train the machine learning and deep learning models. It helps in
finding the local minimum of a function.
The best way to define the local minimum or local maximum of a function using gradient
descent is as follows:
If we move towards a negative gradient or away from the gradient of the function at the
current point, it will give the local minimum of that function.
Whenever we move towards a positive gradient or towards the gradient of the function at

the current point, we will get the local maximum of that function.
The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration.
To achieve this goal, it performs two steps iteratively:
1. Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
2. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
5.3.1.2 What is Cost-function?
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
It helps to increase and improve machine learning efficiency by providing feedback to
this model so that it can minimize error and find the local or global minimum.
Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


18
The cost function is calculated after making a hypothesis with initial parameters and
modifying these parameters using gradient descent algorithms over known data to reduce the
cost function.
Navigating the cost function consists of adjusting the weights. The weights are adjusted
using the following formula:
This is the formula for gradient descent. As we can see, to obtain the new weight, we use
the gradient, the learning rate, and an initial weight.
Adjusting the weights consists of multiple iterations. We take a new step down for each
iteration and calculate a new weight. Using the initial weight and the gradient and learning rate,
we can determine the subsequent weights.
Let’s consider a graphical example of this:

From the graph of the cost function, we can see that:


 To start descending the cost function, we first initialize a random weight.
 Then, we take a step down and obtain a new weight using the gradient and learning rate.
With the gradient, we can know which direction to navigate. We can know the step size
for navigating the cost function using the learning rate.
 We are then able to obtain a new weight using the gradient descent formula.
 We repeat this process until we reach the minimum point of the cost function.
 Once we‘ve reached the minimum point, we find the weights that correspond to the
minimum of the cost function.
5.3.1.3 How does Gradient Descent work?
Before starting the working principle of gradient descent, we should know some basic
concepts to find out the slope of a line from linear regression. The equation for simple linear
regression is given as:
Y=mX+c
Where 'm' represents the slope of the line, and 'c' represents the intercepts on the y-axis.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


19
The starting point(shown in above fig.) is used to evaluate the performance as it is
considered just as an arbitrary point.
At this starting point, we will derive the first derivative or slope and then use a tangent
line to calculate the steepness of this slope. Further, this slope will inform the updates to the
parameters (weights and bias).
The slope becomes steeper at the starting point or arbitrary point, but whenever new
parameters are generated, then steepness gradually reduces, and at the lowest point, it
approaches the lowest point, which is called a point of convergence.
The main objective of gradient descent is to minimize the cost function or the error
between expected and actual. To minimize the cost function, two data points are required:
5.3.1.4 Direction & Learning Rate
These two factors are used to determine the partial derivative calculation of future
iteration and allow it to the point of convergence or local minimum or global minimum.
Learning Rate:
The learning rate is used to calculate the step size at every iteration. Too large a learning rate
and the step sizes may overstep too far past the optimum value. Too small a learning rate may require
much iteration to reach a local minimum. A good starting point for the learning rate is 0.1 and
adjusts as necessary.
It is defined as the step size taken to reach the minimum or lowest point. This is typically
a small value that is evaluated and updated based on the behavior ofthe cost function.
If the learning rate is high, it results in larger steps but also leads to risks of overshootingthe
minimum. At the same time, a low learning rate shows the small step sizes, which
compromisesoverall efficiency but gives the advantage of more precision.

5.3.1.5
Types of Gradient Descent

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


20
Based on the error in various training models, the Gradient Descent learning algorithm
can be divided into
1. Batch gradient descent,
2. stochastic gradient descent, and
3. Mini-Batch gradient descent.
5.3.2 Batch Gradient Descent:
Batch gradient descent (BGD) is used to find the error for each point in the training set
and update the model after evaluating all training examples. This procedure is known as the
training epoch. In simple words, it is a greedy approach where we have to sum over all examples
for each update.
5.3.2.1 Advantages of Batch gradient descent:
1. It produces less noise in comparison to other gradient descent.
2. It produces stable gradient descent convergence.
3. It is Computationally efficient as all resources are used for all training samples.
5.3.3 Stochastic gradient descent
Stochastic gradient descent (SGD) is a type of gradient descent that runs one training
example per iteration.
It processes a training epoch for each example within a dataset and updates each training
example's parameters one at a time. As it requires only one training example at a time, hence it is
easier to store in allocated memory.
However, it shows some computational efficiency losses in comparison to batch gradient
systems as it shows frequent updates that require more detail and speed.
Further, due to frequent updates, it is also treated as a noisy gradient. However, sometimes it can
be helpful in finding the global minimum and also escaping the local minimum.
5.3.3.1 Advantages of Stochastic gradient descent:
1. In Stochastic gradient descent (SGD), learning happens on every example, and it
consists of a few advantages over other gradient descent.
2. It is easier to allocate in desired memory.
3. It is relatively fast to compute than batch gradient descent.
4. It is more efficient for large datasets.
5.3.4 MiniBatch Gradient Descent:
Mini Batch gradient descent is the combination of both batch gradient descent and
stochastic gradient descent.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


21
It divides the training datasets into small batch sizes then performs the updates on those
batches separately. Splitting training datasets into smaller batches make a balance to maintain
thecomputational efficiency of batch gradient descent and speed of stochastic gradient descent.
Hence, we can achieve a special type of gradient descent with higher computationalefficiency
and less noisy gradient descent.
5.3.4.1 Advantages of Mini Batch gradient descent:
1. It is easier to fit in allocated memory.
2. It is computationally efficient.
3. It produces stable gradient descent convergence.
5.3.5 Challenges with the Gradient Descent
Although we know Gradient Descent is one of the most popular methods for
optimization problems, it still also has some challenges.
There are a few challenges as follows:
1. Local Minima and Saddle Point:
For convex problems, gradient descent can find the global minimum easily, while for
non-convex problems, it is sometimes difficult to find the global minimum, where the
machine learning models achieve the best results.

Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum, where the slope of the
cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point,
which reaches a local maximum on one side and a local minimum on the other side.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and
backpropagation Vanishing Gradient occurs when the gradient is smaller than expected. During

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


22
backpropagation, this gradient becomes smaller that causing the decrease in the learning rate of
earlier layers than the later layer of the network. Once this happens, the weight parameters
update until they become insignificant.
3. Exploding Gradient:
Exploding gradient is just opposite to the vanishing gradient as it occurs when the
Gradient is too large and creates a stable model. Further, in this scenario, model weight
increases, and they will be represented as NaN. This problem can be solved using the
dimensionality reduction technique, which helpsto minimize complexity within the model.
5.3.6 Stochastic Gradient Descent (SGD):
The word ‗stochastic‗ means a system or process linked with a random probability.
Stochastic Gradient Descent, a few samples are selected randomly instead of thewhole data
set for each iteration. In Gradient Descent, there is a term called ―batch‖ which denotes the total
number of samples from a dataset that is used for calculating the gradient for each iteration. In
typical Gradient Descent optimization, like Batch Gradient Descent, the batch is taken to be
the whole dataset.
Although using the whole dataset is really useful for getting to the minima in a less
noisy and less random manner, the problem arises when our dataset gets big.
Stochastic Gradient Descent (SGD) is a variant of the Gradient Descent algorithm
used for optimizing machine learning models.
In this variant, only one random training example is used to calculate the gradient and
update the parameters at each iteration.
5.3.6.1 Advantages:
1. Speed: SGD is faster than other variants of Gradient Descent such as Batch Gradient
Descent and Mini-Batch Gradient Descent since it uses only one example to update the
parameters.
2. Memory Efficiency: Since SGD updates the parameters for each training example one at
a time, it is memory-efficient and can handle large datasets that cannot fit into memory.
3. Avoidance of Local Minima: Due to the noisy updates in SGD, it has the ability to
escape from local minima and converge to a global minimum.
5.3.6.2 Disadvantages:
1. Noisy updates: The updates in SGD are noisy and have a high variance, which can make
the optimization process less stable and lead to oscillations around the minimum.
2. Slow Convergence: SGD may require more iterations to converge to the minimum since
it updates the parameters for each training example one at a time.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


23
3. Sensitivity to Learning Rate: The choice of learning rate can be critical in SGD since
using a high learning rate can cause the algorithm to overshoot the minimum, while a
low learning rate can make the algorithm converge slowly.
4. Less Accurate: Due to the noisy updates, SGD may not converge to the exact global
minimum and can result in a suboptimal solution. This can be mitigated by using
techniques such as learning rate scheduling and momentum-based updates
5.3.6.3 Stochastic Gradient Descent (SGD) Algorithm
We find out the gradient of the cost function of a single example at each iteration instead
of the sum of the gradient of the cost function of all the examples.
In SGD, since only one sample from the dataset is chosen at random for each iteration, the
path taken by the algorithm to reach the minima is usually noisier than your typical Gradient
Descent algorithm. But that doesn‘t matter all that much because the path taken by the algorithm
does not matter, as long as we reach the minima and with a significantly shorter training time.
The path is taken by Batch Gradient Descent as shown below as follows:

A path has been taken by Stochastic Gradient Descent –


SGD is generally noisier than typical Gradient Descent, it usually took a higher number

of iterations to reach the minima, because of its randomness in its descent. Even though it
requires a higher number of iterations to reach the minima than typical Gradient Descent, it is
still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing
a learning algorithm.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


24
5.3.6.3.1 SGD Algorithm
SGD modifies the batch gradient descent algorithm by calculating the gradient for only
one training example at every iteration.
The steps for performing SGD are as follows:
Step 1: Randomly shuffle the data set of size m
Step 2: Select a learning rate
Step 3: Select initial parameter values as the starting point.
Step 4: Update all parameters from the gradient of a single training example, i.e. compute
Step 5: Repeat Step 4 until a local minimum is reached
By calculating the gradient for one data set per iteration, SGD takes a less direct route towards
the local minimum.
However, SGD has the advantage of having the ability to incrementally update an objective
function when new training data is available at minimum cost.
Gradient descent is an optimization algorithm used to find the weights corresponding to
the cost function. It needs to descend the cost function until its minimum point to find these
weights. It needs the gradient and the learning rate to descend the cost function. The gradient
helps find the direction for reaching the minimum point of the cost function. The learning rate
helps determine the speed at which to reach the minimum point. Upon reaching the minimum
point, gradient descent finds weights corresponding to the minimum point.
5.4 Error BackPropagation
5.4.1 What is Error BackPropagation
The error backpropagation learning algorithm is tool used during the training of neural
networks. The main goal is to compute the gradient of the loss function (also known as the error
function or cost function). These gradients are required for many optimization routines such as
stochastic gradient descent and its many variants.
5.4.1.1 How does Error Backpropagation Work?
Calculating the gradients relies entirely on the rules of differential calculus. As a neural
network is a series of layers, for each data point the loss function is computed by passing a label
data point through the network (feed forward).
Next, the gradients are calculated starting from the final layer and then through use of the
chain rule, the gradients can be passed backwards to calculate the gradients in the previous layers.
The goal is to get the gradients for the loss function with respect to each model parameter
(weights for each neural node connection as well as the bias weights).

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


25
This point of this backwards method of error checking is to more efficiently calculate the
gradient at each layer than the traditional approach of calculating each layer‘s gradient
separately.
5.4.1.2 What are the Uses of Error Backpropagation?
Backpropagation is especially useful for deep neural networks working on error-prone
projects, such as image or speech recognition.
Taking advantage of the chain and power rules allows backpropagation to function with any
number of outputs and better train all sorts of neural networks.
5.4.1.3 Back Propagation Algorithm in Neural Network
In an artificial neural network, the values of weights and biases are randomly initialized.
Due to random initialization, the neural network probably has errors in giving the correct output.
We need to reduce error values as much as possible. So, for reducing these error values, we need a
mechanism that can compare the desired output of the neural network with the network‘s output
that consists of errors and adjusts its weights and biases such that it gets closerto the desired output
after each iteration.
For this, we train the network such that it back propagates and updates the weights and
biases. This is the concept of the back propagation algorithm.
Below are the steps that an artificial neural network follows to gain maximum accuracy
and minimize error values:
 Understanding Deep Learning
 Parameter Initialization
 Feedforward Propagation
 Backpropagation

Parameter Initialization
In this, parameters, i.e., weights and biases, associated with an artificial neuron are
randomly initialized. After receiving the input, the network feeds forwards the input and it

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


26
makes associations with weights and biases to give the output. The output associated with those
random values is most probably not correct. So, next, we will see feedforward propagation.
Feed forward propagation
After initialization, when the input is given to the input layer, it propagates the input into
hidden units at each layer. The nodes here do their job without being aware of whether the
results produced are accurate or not (i.e., they don‘t re-adjust according to the results produced).
Then, finally, the output is produced at the output layer. This is called feedforward propagation.
Back propagation in Neural Networks
The principle behind the back propagation algorithm is to reduce the error values in
randomly allocated weights and biases such that it produces the correct output.
The system is trained in the supervised learning method, where the error between the
system‘s output and a known expected output is presented to the system and used to modify its
internal state. We need to update the weights such that we get the global loss minimum. This is
how back propagation in neural networks works.
When the gradient is negative, an increase in weight decreases the error. When the
gradient is positive, the decrease in weight decreases the error.
The main features of Backpropagation are the iterative, recursive and efficient method
through which it calculates the updated weight to improve the network until it is not able to
perform the task for which it is being trained. Derivatives of the activation function to be known
at network design time is required to Backpropagation.
5.4.2 How Backpropagation works?
Let start with an example and do it mathematically to understand how exactly updates the
weight using Backpropagation.

Input values
X1=0.05 , X2=0.10
Initial weight
W1=0.15, W5=0.40 ,W2=0.20,w6=0.45, W3=0.25,W7=0.50, W4=0.30,W8=0.55

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


27
Bias Values b1=0.35 b2=0.60
Target Values T1=0.01 T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Step 1: Forward Pass
To find the value of H1 we first multiply the input value from the
weights asH1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35  H1=0.3775
To calculate the final result of H1, we performed the sigmoid function as

We will calculate the value of H2 in the same way as


H1H2=x1×w3+x2×w4+b1
H2=0.05×0.25+0.10×0.30+0.35  H2=0.3925
To calculate the final result of H1, we performed the sigmoid function as

Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2. To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 fromthe weights as
y1=H1×w5+H2×w6+b2 y1=0.593269992×0.40+0.596884378×0.45+0.60y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as

We will calculate the value of y2 in the same way as y1 y2=H1×w7+H2×w8+b2


y2=0.593269992×0.50+0.596884378×0.55+0.60  y2=1.2249214
To calculate the final result of H1, we performed the sigmoid function as

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


28
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our target
valuesT1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from
thetarget outputs. The total error is calculated as

So, the total error is

Now, we will backpropagate this error to update the weights using a backward pass.
Step 2: Backward pass at the output layer
To update the weight, we calculate the error correspond to each weight with the help of a
totalerror. The error on weight w is calculated by differentiating total error with respect to
w.

We perform backward process so first consider the last weight w5 as

From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can easily
differentiate it with respect to w5 as

Now, we calculate each term one by one to differentiate Etotal with respect to w5 as

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


29
Putting the value of e-y in equation (5)

So, we put the values of in equation no (3) to find the final result.

Now, we will calculate the updated weight w5new with the help of the following formula

In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648 ,w6new=408666186, w7new=0.511301270, w8new=0.561370121

5.5 from shallow networks to deep networks


5.5.1 Shallow Network
In simple terms, the depth of a neural network refers to the number of layers it contains.
Shallow networks typically have only one or two hidden layers, while deep networks can have
dozens or even hundreds of layers. The increased depth allows deep neural networks to capture
more complex patterns in the input data, leading to improved accuracy and performance.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


30
One of the main advantages of deep neural networks is their ability to perform "end-to-
end learning." This means that the model can learn to perform a complex task, such as object
recognition, directly from raw input data, without the need for manual feature engineering. In
contrast, shallow networks often require handcrafted features to be extracted from the input data
before they can be fed into the model. The below diagram shows the shallow Network

However, there are also some drawbacks to deep neural networks. One challenge is that
they can be more difficult to train than shallow networks, due to the increased number of
parameters and the risk of overfitting. Another challenge is that deep networks can be
computationally expensive to train and require a lot of data to achieve good performance.
Despite these challenges, deep neural networks have become a popular and powerful tool
in academic writing and research. They have been used to achieve state-of-the-art results on a
wide range of tasks, from image and speech recognition to natural language processing and
game playing. In particular, deep learning has shown great promise for advancing the field of
artificial intelligence and enabling machines to perform increasingly complex tasks.
In conclusion, the difference between deep neural networks and shallow networks lies in
the number of layers they contain. While deep networks offer improved accuracy and the ability
to perform end-to-end learning, they also present challenges in terms of training and
computational resources.
A shallow neural network has only one hidden layers netween the input and output
layers. The input layer receives the data, the hidden layers process it, and the final layer
produces the output.
Shallow neural network are simpler, more easily trained and have greater computational
efficiency than deep neural networks, which may have thousands of hidden units in dozens of
layers.
Shallow networks are typically used for simpler tasks such as linear regression, binary
classification, or low dimensional feature extraction.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


31
A neural network with more than one hidden layer is no longer considered a shallow
neural network. In some cases, however, a neural network with 2 or 3 hidden layers, each with a
small number of hidden units and having simple connectivity between them may produce
straightforward outputs and may still be considered as ―Shallow Network‖.
One way to think about it is like a hierarchy of decision-making. Just like how a human
brain makes decisions by processing information in layers, a deep neural network learns to make
decisions by processing information through multiple hidden layers. On the other hand, a
shallow network is like having just one layer of decision-making, which might not be enough to
capture the complexity of the problem at hand. It's like trying to understand a complex book by
just reading the summary on the back cover.
A shallow network might be used for simple tasks like image classification, while a deep
network might be used for more complex tasks like image segmentation or natural language
processing. For instance, a shallow neural network with a single hidden layer can be used to
recognize handwritten digits from the MNIST dataset. But for a more complex task like
identifying different objects in an image, a deep neural network like the popular ResNet
architecture might be used.
5.5.2 Deep Network
which often have a complex hidden layer structure with a wide variety of different layers,
such as a convolutional layer, max-pooling layer, dense layer, and other unique layers. These
additional layers help the model to understand problems better and provide optimal solutions to
complex projects.
A deep neural network has more layers (more depth) than ANN and each layer adds
complexity to the model while enabling the model to process the inputs concisely for outputting
the ideal
Deep neural networks have garnered extremely high traction due to their high efficiency
in achieving numerous varieties of deep learning projects.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


32
5.5.2.1 Why use a DNN?
After training a well-built deep neural network, they can achieve the desired results with high
accuracy scores. They are popular in all aspects of deep learning, including computer vision, natural
language processing, and transfer learning.
The premier examples of the prominence of deep neural networks are their utility in object
detection with models such as YOLO (You Only Look Once), language translation tasks with BERT
(Bidirectional Encoder Representations from Transformers) models, transfer learning models, such as
VGG-19, RESNET-50, efficient net, and other similar networks for image processing projects.
5.5.3 Deep Nets and Shallow Nets
There is no clear threshold of depth that divides shallow learning from deep learning; but
it is mostly agreed that for deep learning which has multiple non-linear layers, CAP must be
greater than two.
Basic node in a neural net is a perception mimicking a neuron in a biological neural
network. Then we have multi-layered Perception or MLP. Each set of inputs is modified by a set
of weights and biases; each edge has a unique weight and each node has a unique bias.
The prediction accuracy of a neural net depends on its weights and biases.
The process of improving the accuracy of neural network is called training. The output
from a forward prop net is compared to that value which is known to be correct.
The cost function or the loss function is the difference between the generated output and
the actual output.
The point of training is to make the cost of training as small as possible across millions
of training examples. To do this, the network tweaks the weights and biases until the prediction
matches the correct output.
Once trained well, a neural net has the potential to make an accurate prediction every
time.
When the pattern gets complex and you want your computer to recognise them, you have to go
for neural networks. In such complex pattern scenarios, neural network out performs all other
competing algorithms.
There are now GPUs that can train them faster than ever before. Deep neural networks
are already revolutionizing the field of AI
Computers have proved to be good at performing repetitive calculations and following
detailed instructions but have been not so good at recognising complex patterns.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


33
If there is the problem of recognition of simple patterns, a support vector machine (svm)
or a logistic regression classifier can do the job well, but as the complexity of pattern increases,
there is no way but to go for deep neural networks.
Therefore, for complex patterns like a human face, shallow neural networks fail and have
no alternative but to go for deep neural networks with more layers. The deep nets are able to do
their job by breaking down the complex patterns into simpler ones. For example, human face; a
deep net would use edges to detect parts like lips, nose, eyes, ears and so on and then re-combine
these together to form a human face
The accuracy of correct prediction has become so accurate that recently at a Google
Pattern Recognition Challenge, a deep net beat a human.
This idea of a web of layered perceptrons has been around for some time; in this area,
deep nets mimic the human brain. But one downside to this is that they take long time to train, a
hardware constraint
However recent high performance GPUs have been able to train such deep nets under a
week; while fast cpus could have taken weeks or perhaps months to do the same.
5.5.4 Advantage
The main advantage of a shallow network is that it is computationally less expensive to
train, and can be sufficient for simple tasks. However, it may not be powerful enough to capture
complex patterns in the data. A deep network, on the other hand, can capture more complex
patterns in the data and potentially achieve higher accuracy, but it is more computationally
expensive to train and may require more data to avoid overfitting. Additionally, deep networks
can be more challenging to design and optimize than shallow networks.
5.5.5 Recurrent Neural Networks - RNNs
RNNS are neural networks in which data can flow in any direction. These networks are
used for applications such as language modelling or Natural Language Processing (NLP).
The basic concept underlying RNNs is to utilize sequential information. In a normal
neural network it is assumed that all inputs and outputs are independent of each other. If we want
to predict the next word in a sentence we have to know which words came before it.
RNNs are called recurrent as they repeat the same task for every element of a sequence,
with the output being based on the previous computations. RNNs thus can be said to have a
―memory‖ that captures information about what has been previously calculated. In theory, RNNs
can use information in very long sequences, but in reality, they can look back only a few steps.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


34
Long short-term memory networks (LSTMs) are most commonly used RNNs. Together
with convolutional Neural Networks, RNNs have been used as part of a model to generate
descriptions for unlabelled images. It is quite amazing how well this seems to work.
5.5.6 Convolutional Deep Neural Networks - CNNs
If we increase the number of layers in a neural network to make it deeper, it increases the
complexity of the network and allows us to model functions that are more complicated.
However, the number of weights and biases will exponentially increase. As a matter of fact,
learning such difficult problems can become impossible for normal neural networks. This leads
to a solution, the convolutional neural networks.
CNNs are extensively used in computer vision; have been applied also in acoustic
modelling for automatic speech recognition.
The idea behind convolutional neural networks is the idea of a ―moving filter‖ which
passes through the image. This moving filter, or convolution, applies to a certain neighbourhood
of nodes which for example may be pixels, where the filter applied is 0.5 x the node value −
Noted researcher Yann LeCun pioneered convolutional neural networks. Facebook as
facial recognition software uses these nets. CNN have been the go to solution for machine vision
projects. There are many layers to a convolutional network. In Imagenet challenge, a machine
was able to beat a human at object recognition in 2015.
In a nutshell, Convolutional Neural Networks (CNNs) are multi-layer neural networks.
The layers are sometimes up to 17 or more and assume the input data to be images.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


35
CNNs drastically reduce the number of parameters that need to be tuned. So, CNNs efficiently
handle the high dimensionality of raw images.
5.6 Unit saturation (aka the vanishing gradient problem)
5.6.1 What is Vanishing Gradient Problem?
Vanishing gradient problem is a phenomenon that occurs during the training of deep
neural networks, where the gradients that are used to update the network become extremely
small or "vanish" as they are backpropogated from the output layers to the earlier layers.
During the training process of the neural network, the goal is to minimize a loss function
by adjusting the weights of the network. The backpropogation algorithm calculates these
gradients by propogating the error from the output layer to the input layer.
5.6.2 Introduction to Vanishing Gradient Problem
In Machine Learning, the Vanishing Gradient Problem is encountered while training
Neural Networks with gradient-based methods (example, Back Propagation). This problem
makes it hard to learn and tune the parameters of the earlier layers in the network.
The vanishing gradients problem is one example of unstable behaviour that you may
encounter when training a deep neural network.
It describes the situation where a deep multilayer feed-forward network or a recurrent
neural network is unable to propagate useful gradient information from the output end of the
model back to the layers near the input end of the model.
The result is the general inability of models with many layers to learn on a given dataset,
or for models with many layers to prematurely converge to a poor solution.
5.6.3 Methods proposed to overcome vanishing gradient problem
The vanishing gradient problem is caused by the derivative of the activation function
used to create the neural network. The simplest solution to the problem is to replace the
activation function of the network.
Instead of sigmoid, use an activation function such as ReLU. Rectified Linear Units
(ReLU) are activation functions that generate a positive linear output when they are applied to
positive input values. If the input is negative, the function will return zero.
1. Multi-level hierarchy
2. Long short – term memory
3. Faster hardware
4. Residual neural networks (ResNets)
5. ReLU

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


36
6. Residual neural networks (ResNets)
One of the newest and most effective ways to resolve the vanishing gradient problem is
with residual neural networks, or ResNets (not to be confused with recurrent neural networks). It
was noted before ResNets that a deeper network would have higher training error than the
shallow network.
The weights of a neural network are updated using the backpropagation algorithm. The
backpropagation algorithm makes a small change to each weight in such a way that the loss of
the model decreases. How does this happen? It updates each weight such that it takes a step in
the direction along which the loss decreases. This direction is nothing but the gradient of this
weight (concerning the loss).
Using the chain rule, we can find this gradient for each weight. It is equal to (local
gradient) x (gradient flowing from ahead),
Here comes the problem. As this gradient keeps flowing backwards to the initial layers,
this value keeps getting multiplied by each local gradient. Hence, the gradient becomes smaller
and smaller, making the updates to the initial layers very small, increasing the training time
considerably. We can solve our problem if the local gradient somehow became 1.
5.6.3.1 ResNet
How can the local gradient be 1, i.e, the derivative of which function would always be 1?
The Identity function!
As this gradient is back propagated, it does not decrease in value because the local
gradient is 1.
The ResNet architecture, shown below, should now make perfect sense as to how it
would not allow the vanishing gradient problem to occur. ResNet stands for Residual Network.
These skip connections act as gradient superhighways, allowing the gradient to flow
unhindered. And now you can understand why ResNet comes in flavours like ResNet50,
ResNet101 and ResNet152.
5.6.3.2 Sigmoid function
Sigmoid functions are used frequently in neural networks to activate neurons. It is a
logarithmic function with a characteristic S shape. The output value of the function is between 0
and 1. The sigmoid function is used for activating the output layers in binary classification
problems. It is calculated as follows:

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


37
First derivatives of sigmoid functions are bell curves with values ranging from 0 to 0.25.
5.7 ReLU, Hyperparameter Tuning

5.7.1 What is ReLU Activation Function?


ReLU stands for rectified linear activation unit and is considered one of the few
milestones in the deep learning revolution. It is simple yet really better than its predecessor
activation functions such as sigmoid or tanh. ReLU activation function formula
f(x)=max(0,x)
ReLU function is its derivative both are monotonic. The function returns 0 if it receives
any negative input, but for any positive value x, it returnsthat value back. Thus it gives an output
that has a range from 0 to infinity. First, let us define a ReLU functiondef ReLU(x):
if x>0:
return x
else:
return 0
ReLU is used as a default activation function and nowadays and it is the most commonly
usedactivation function in neural networks, especially in CNNs.
5.7.1.1 Why is ReLU the best activation function?
As we have seen above, the ReLU function is simple and it consists of no heavy
computation as there is no complicated math. The model can, therefore, take less time to train or
run. One more important property that we consider the advantage of using ReLU activation
function is sparsity.
Usually, a matrix in which most entries are 0 is called a sparse matrix and similarly, we
desire a property like this in our neural networks where some of the weights are zero. Sparsity
results in concise models that often have better predictive power and less overfitting/noise. In a
sparse network, it‘s more likely that neurons are actually processing meaningful aspects of the
problem. For example, in a model detecting human faces in images, there may be a neuron that
can identify ears, which obviously shouldn‘t be activated if the image is a not of a face and is a
ship or mountain.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


38
Since ReLU gives output zero for all negative inputs, it‘s likely for any given unit to not
activate at all which causes the network to be sparse. Now let us see how ReLu activation
function is better than previously famous activation functions such as sigmoid and tanh.
The activations functions that were used mostly before ReLU such as sigmoid or tanh
activation function saturated. This means that large values snap to 1.0 and small values snap to
-1 or 0 for tanh and sigmoid respectively.
Further, the functions are only really sensitive to changes around their mid-point of their
input, such as 0.5 for sigmoid and 0.0 for tanh. This caused them to have a problem called
vanishing gradient problem.
5.7.2 Leaky ReLU activation function
Leaky ReLU function is an improved version of the ReLU activation function. As for the
ReLU activation function, the gradient is 0 for all the values of inputs that are less than zero,
which would deactivate the neurons in that region and may cause dying ReLU problem. Leaky
ReLU is defined to address this problem. Instead of defining the ReLU activation function as 0
for negative values of inputs(x), we define it as an extremely small linear component of x.
Here is the formula for this activation function
f(x)=max(0.01*x , x).
This function returns x if it receives any positive input, but for any negative value of x, it
returns a really small value which is 0.01 times x. Thus it gives an output for negative values as
well. By making this small modification, the gradient of the left side of the graph comes out to be
anon zero value. Hence we would no longer encounter dead neurons in that region.
5.7.2.1 Derivative Of ReLU:
The derivative of an activation function is required when updating the weights during the
backpropagation of the error. The slope of ReLU is 1 for positive values and 0 for negative
values. It becomes non-differentiable when the input x is zero, but it can be safely assumed to be
zero and causes no problem in practice.
5.7.2.2 Advantages of ReLU:
ReLU is used in the hidden layers instead of Sigmoid or tanh as using sigmoid or tanh in
the hidden layers leads to the infamous problem of "Vanishing Gradient". The "Vanishing
Gradient" prevents the earlier layers from learning important information when the network is
backpropagating.
The sigmoid which is a logistic function is more preferrable to be used in regression or
binary classification related problems and that too only in the output layer, as the output of a
sigmoid function ranges from 0 to 1. Also Sigmoid and tanh saturate and have lesser sensitivity.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


39
Some of the advantages of ReLU are:
1. Simpler Computation: Derivative remains constant i.e 1 for a positive input and thus
reduces the time taken for the model to learn and in minimizing the errors.
2. Representational Sparsity: It is capable of outputting a true zero value.
3. Linearity: Linear activation functions are easier to optimize and allow for a smooth
flow. So, it is best suited for supervised tasks on large sets of labelled data.
5.7.2.3 Disadvantages of ReLU:
1. Exploding Gradient: This occurs when the gradient gets accumulated, this causes a
large differences in the subsequent weight updates. This as a result causes instability
when converging to the global minima and causes instability in the learning too.
2. Dying ReLU: The problem of "dead neurons" occurs when the neuron gets stuck in the
negative side and constantly outputs zero. Because gradient of 0 is also 0, it's unlikely for
the neuron to ever recover. This happens when the learning rate is too high or negative
bias is quite large.
5.7.3 Hyperparameter tuning in deep learning
Hyperparameter tuning consists of finding a set of optimal hyperparameter values for a
learning algorithm while applying this optimized algorithm to any data set.
That combination of hyperparameters maximizes the model's performance, minimizing a
predefined loss function to produce better results with fewer errors.
5.7.3.1 Hyperparameter types
Some important hyperparameters that require tuning in neural networks are:
1. Number of hidden layers: It‘s a trade-off between keeping our neural network as simple
as possible (fast and generalized) and classifying our input data correctly. We can start with
values of four to six and check our data‘s prediction accuracy when we increase or
decrease this hyperparameter.
2. Number of nodes/neurons per layer: More isn't always better when determining how
many neurons to use per layer. Increasing neuron count can help, up to a point. But
layers that are too wide may memorize the training dataset, causing the network to be
less accurate on new data.
3. Learning rate: Model parameters are adjusted iteratively — and the learning rate
controls the size of the adjustment at each step. The lower the learning rate, the lower the
changes to parameter estimates are. This means that it takes a longer time (and more data)
to fit the model but it also means that it is more likely that we actually find the minimum
loss. Momentum: Momentum helps us avoid falling into local minima by resisting

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


40
rapid changes to parameter values. It encourages parameters to keep changing in
the direction they were already changing, which helps prevent zig-zagging on every
iteration. Aim to start with low momentum values and adjust upward as needed.
5.7.3.2 Methods for tuning hyperparameters
Now that we understand what hyperparameters are and the importance of tuning them, we
need to know how to choose their optimal values. We can find these optimal hyperparameter values
using manual or automated methods.
When tuning hyperparameters manually, we typically start using the default
recommended values or rules of thumb, then search through a range of values using trial-and-
error. But manual tuning is a tedious and time-consuming approach. It isn‘t practical when there
are many hyperparameters with a wide range.
Automated hyperparameter tuning methods use an algorithm to search for the optimal
values. Some of today‘s most popular automated methods are grid search, random search, and
Bayesian optimization.
5.7.3.3 Grid search
Grid search is a sort of ―brute force‖ hyperparameter tuning method. We create a grid of
possible discrete hyperparameter values then fit the model with every possible combination. We
record the model performance for each set then select the combination that has produced the best
performance.
It create a grid of possible discrete hyperparameter values, then fit the odel with every
possible combination.

Grid search is an exhaustive algorithm that can find the best combination of
hyperparameters.
5.7.3.4 Random search
The random search method (as its name implies) chooses values randomly rather than
using a predefined set of values like the grid search method.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


41
Random search tries a random combination of hyperparameters in each iteration and
records the model performance. After several iterations, it returns the mix that produced the best
result.
It tries a random combination of hyperparameters in each iteration and record the model
performance.
Random search is appropriate when we have several hyperparameters with relatively
large search domains.
The benefit is that random search typically requires less time than grid search to return a
comparable result.

5.7.4 What are Hyperparameters? How to Differ from a Model Parameter?


As we know that there are parameters that are internally learned from the given dataset
and derived from the dataset, they are represented in making predictions, classification and etc.,
These are so-called Model Parameters, and they are varying with respect to the nature ofthe data
we couldn‘t control this since it depends on the data. Like ‗m‗ and ‗C‗ in linear equation, which
is the value of coefficients learned from the given dataset.
Some set of parameters that are used to control the behaviour of the model/algorithm and
adjustable in order to obtain an improvised model with optimal performance is so- called
Hyperparameters.
The best model algorithm(s) will sparkle if your best choice of Hyper-parameters. If you
ask me what is Hyperparameters in simple words, the one-word answer is
Configuration.
Without thinking too much, I can say quick Hyperparameter is ―Train-Test Split Ratio
(80-20)” in our simple linear regression model.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


42
5.7.4.1 Data Leakage
Data Leakage, after successful testing with perfect accuracy followed by training the
model then the model has been planned to move into production. At this moment ALL Is Well.
Still, the actual/real-time data is applied to this model in the production environment, you will
get poor scores.
By this time, you may think that why did this happen and how to fix this. This is all
because of the data that we split data into training and testing subsets.
During the training the model has the knowledge of data, which the model is trying to
predict, this results in inaccurate and bad prediction outcomes after the model is deployed into
production.
5.7.4.2 Steps to Perform Hyperparameter Tuning
1. Select the right type of model.
2. Review the list of parameters of the model and build the HP space
3. Finding the methods for searching the hyperparameter space
4. Applying the cross-validation scheme approach
Assess the model score to evaluate the model
5.7.5 What is the purpose of hyperparameter tuning?
Hyperparameter tuning takes advantage of the processing infrastructure of Google Cloud
to test different hyperparameter configurations when training your model. It can give you
optimized values for hyperparameters, which maximizes your model's predictive accuracy.
5.8 Batch Normalization
5.8.1 Introduction to Batch Normalization
Normalization is a data pre-processing tool used to bring the numerical data to a
common scale without distorting its shape.
Batch normalization, is a process to make neural networks faster and more stable
through adding extra layers in a deep neural network.
The new layer performs the standardizing and normalizing operations on the input of a
layer coming from a previous layer.
A typical neural network is trained using a collected set of input data called batch.
Similarly, the normalizing process in batch normalization takes place in batches, not as a single
input.
Batch Normalization is a technique used to improve the training of deep neural networks.
Introduced by Sergey Ioffe and Christian Szegedy in 2015, batch normalization is used to

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


43
normalize the inputs of each layer in such a way that they have a mean output activation of zero
and a standard deviation of one.
This normalization process helps to combat issues that deep neural networks face, such
as internal covariate shift, which can slow down training and affect the network's ability to
generalize from the training data. One of the most common problems of data science professionals
is to avoid over-fitting. Have you come across a situation when your model is performing very
well on the training data but is unable to predict the test data accurately.

5.8.1.1 How Batch Normalization Works


Batch normalization works by normalizing the output of a previous activation layer by
subtracting the batch mean and dividing by the batch standard deviation. After this step, the
result is then scaled and shifted by two learnable parameters, gamma and beta, which are unique
to each layer. This process allows the model to maintain the mean activation close to 0 and the
activation standard deviation close to 1.
The normalization step is as follows:
1. Calculate the mean and variance of the activations for each feature in a mini-batch.
2. Normalize the activations of each feature by subtracting the mini-batch mean and
dividing by the mini-batch standard deviation.
3. Scale and shift the normalized values using the learnable parameters gamma and beta,
which allow the network to undo the normalization if that is what the learned behavior
requires.
Batch normalization is typically applied before the activation function in a network layer,
although some variations may apply it after the activation function.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


44
The reason is your model is overfitting. The solution to such a problem is regularization.
Regularization techniques help to improve a model and allows it to converge faster. We
have several regularization tools at our end, some of them are early stopping, dropout, weight
initialization techniques, and batch normalization. The regularization helps in preventing the
over-fitting of the model and the learning process becomes more efficient.
5.8.1.2 Benefits of Batch Normalization
Batch normalization offers several benefits to the training process of deep neural
networks:
1. Improved Optimization: It allows the use of higher learning rates, speeding up the
training process by reducing the careful tuning of parameters.
2. Regularization: It adds a slight noise to the activations, similar to dropout. This can help
to regularize the model and reduce overfitting.
3. Reduced Sensitivity to Initialization: It makes the network less sensitive to the initial
starting weights.
4. Allows Deeper Networks: By reducing internal covariate shift, batch normalization
allows for the training of deeper networks.
5.8.1.3 Example:

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


45
Initially, our inputs X1, X2, X3, X4 are in normalized form as they are coming fromthe
pre-processing stage.
When the input passes through the first layer, it transforms, as a sigmoid function
applied over the dot product of input X and the weight matrix W

Similarly, this transformation will take place for the second layer and go till the last
layer L as shown in the following image.

Although, our input X was normalized with time the output will no longer be on the same scale.
As the data go through multiple layers of the neural network and L activation functions are
applied, it leads to an internal co-variate shift in the data.
5.9 Regularization, Dropout
5.9.1 What is Regularization?

If you‘ve built a neural network before, you know how complex they are. This makes them more
prone to overfitting.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


46
Regularization is a technique which makes slight modifications to the learning
algorithm such that the model generalizes better. This in turn improves the model‘s
performance on the unseen data as well.

5.9.1.1 How does Regularization help reduce Overfitting?


Let‘s consider a neural network which is overfitting on the training data as shown
in the image below.

Assume that our regularization coefficient is so high that some of the weight
matrices are nearly equal to zero.

This will result in a much simpler linear network and slight underfitting of the
training data. Such a large value of the regularization coefficient is not that useful. We need
to optimize the value of regularization coefficient in order to obtain a well-fitted model as
shown in the image below.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


47
5.9.1.3 Different Regularization Techniques in Deep Learning
Now that we have an understanding of how regularization helps in reducing
overfitting, we‘ll learn a few different techniques in order to apply regularization in deep
learning.
5.9.1.4 L2 & L1 regularization
L1 and L2 are the most common types of regularization. These update the general
costfunction by adding another term known as the regularization term.
Cost function = Loss (say, binary cross entropy) + Regularization term
Due to the addition of this regularization term, the values of weight matrices
decrease because it assumes that a neural network with smaller weight matrices leads to
simpler models. Therefore, it will also reduce overfitting to quite an extent.
However, this regularization term differs in L1 and L2.
In L2, we have:

Here, lambda is the regularization parameter. It is the hyperparameter whose value is


optimized for better results.
L2 regularization is also known as weight decay as it forces the weights to decay towards
zero(but not exactly zero).
In L1, we have:

In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our model.
Otherwise, we usually prefer L2 over it.
5.9.2 Dropout
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used regularization
technique in the field of deep learning.
To understand dropout, let‘s say our neural network structure is akin to the one shown

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


48
below:

So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as shown below.

So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine learning.
This probability of choosing how many nodes should be dropped is the
hyperparameter of the dropout function. As seen in the image above, dropout can be
applied to both the hiddenlayers as well as the input layers.
5.9.2.1 Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data.
In machine learning, we were not able to increase the size of training data as the labeled
data wastoo costly.
But, now let‘s consider we are dealing with images. In this case, there are a few
ways of increasing the size of the training data – rotating the image, flipping, scaling,
shifting, etc. In the below image, some transformation has been done on the handwritten
digits dataset.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


49
This technique is known as data augmentation. This usually provides a big leap in
improving the accuracy of the model.
5.9.2.2 Early stopping
Early stopping is a kind of cross-validation strategy where we keep one part of the training
set as the validation set. When we see that the performance on the validation set is getting

worse, we immediately stop the training on the model. This is known as early stopping.
In the above image, we will stop training at the dotted line since after that our model will start
overfitting on the training data.

CS3491 – AI & ML UNIT 5 Prepared By Mrs. R. Anne Pratheeba, AP/CSE


50

You might also like