Unit 5
Unit 5
Scientists had discovered that brain cells (Neurons) receive input from our senses by
electrical signals. The Neurons, then again, use electrical signals to store information, and to
make decisions based on previous input.
Frank had the idea that Perceptrons could simulate brain principles, with the ability to
learn and make decisions. In 1957 he started something really big. Frank "invented"
a Perceptron program, on an IBM 704 computer at Cornell Aeronautical Laboratory.
5.1.1.1 Definition of Perceptron
A Perceptron is an Artificial Neuron. It is the simplest possible Neural Network.
5.1.1.2 What is Artificial Neuron
An artificial neuron is a mathematical function based on a model of biological neurons,
where each neuron takes inputs, weighs them separately, sums them up and passes this sum
through a nonlinear function to produce output.
Dendrites Input
Axon Output
The typical Artificial Neural Network looks something like the given figure.
5.1.4.2 Notations
In the representation below:
ai(in) refers to the ith value in the input layer, ai(h) refers to the ith unit in the hidden
layer, and ai(out) refers to the ith unit in the output layer. ao(in) is simply the bias unit and is
equal to 1; it will have the corresponding weight w0. The weight coefficient from layer l to layer
l+1 is represented by wk,j(l)
5.1.4.3 Advantages of Multi-Layer Perceptron:
1. A multi-layered perceptron model can be used to solve complex non-linear problems.
2. It works well with both small and large input data.
3. It helps us to obtain quick predictions after the training.
4. It helps to obtain the same accuracy ratio with large as well as small data.
5.1.4.4 Disadvantages of Multi-Layer Perceptron:
1. In Multi-layer perceptron, computations are difficult and time-consuming.
2. In multi-layer Perceptron, it is difficult to predict how much the dependent variable
affects each independent variable.
3. The model functioning depends on the quality of the training.
The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-
5.2.3.1 Types of Non-linear Activation Function
1. Sigmoid or Logistic Activation Function
The Sigmoid Function curve looks like a S-shape.
The advantage is that the negative inputs will be mapped strongly negative and the zero
inputs will be mapped near zero in the tanh graph.
The function is differentiable. The function is monotonic while its derivative is not
monotonic. The tanh function is mainly used classification between two classes.
3. ReLU (Rectified Linear Unit) Activation Function
The ReLU is the most used activation function in the world right now.Since, it is
used in almost all the convolutional neural networks or deep learning.
The leak helps to increase the range of the ReLU function. Usually, the value of ‗a’ is 0.01
or so.
When ‗a’ is not 0.01 then it is called Randomized ReLU. Therefore the range of the Leaky
ReLU is (-infinity to infinity). Both Leaky and Randomized ReLU functions are monotonic in
nature. Also, their derivatives are monotonic in nature.
5.2.4 Training Network
Neural network training is the process of teaching a neural network to perform a task.
Neural networks learn by initially processing several large sets of labeled or unlabeled data. By
using these examples, they can then process unknown inputs more accurately.
5.2.4.1 Supervised learning
In supervised learning, data scientists give artificial neural networks labeled datasets that
provide the right answer in advance. For example, a deep learning network training in facial
recognition initially processes hundreds of thousands of images of human faces, with various
terms related to ethnic origin, country, or emotion describing each image.
The neural network slowly builds knowledge from these datasets, which provide the right
answer in advance. After the network has been trained, it starts making guesses about the ethnic
origin or emotion of a new image of a human face that it has never processed before.
Fitting a neural network involves using a training dataset to update the model weights to
create a good mapping of inputs to outputs.
When the neural network gives out the incorrect output, this leads to an output error. This
error is the difference between the actual and predicted outputs. A cost function measures this
error.
The cost function (J) indicates how accurately the model performs. It tells us how far-off our
predicted output values are from our actual values. It is also known as the error. Because the cost
function quantifies the error, we aim to minimize the cost function.
Essentially, backpropagation aims to calculate the negative gradient of the cost function.
This negative gradient is what helps in adjusting of the weights. It gives us an idea of how we
need to change the weights so that we can reduce the cost function.
Backpropagation uses the chain rule to calculate the gradient of the cost function. The
chain rule involves taking the derivative. This involves calculating the partial derivative of each
parameter. These derivatives are calculated by differentiating one weight and treating the
other(s) as a constant. As a result of doing this, we will have a gradient. Since we have
calculated the gradients, we will be able to adjust the weights.
5.2.4.3 Learning as Optimization
Deep learning neural network models learn to map inputs to outputs given a training
dataset of examples.
The training process involves finding a set of weights in the network that proves to be
good, or good enough, at solving the specific problem.
This training process is iterative, meaning that it progresses step by step with small
updates to the model weights each iteration and, in turn, a change in the performance of the
model each iteration.
The iterative training process of neural networks solves an optimization problem that
finds for parameters (model weights) that result in a minimum error or loss when evaluating the
examples in the training dataset.
the current point, we will get the local maximum of that function.
The main objective of using a gradient descent algorithm is to minimize the cost function
using iteration.
To achieve this goal, it performs two steps iteratively:
1. Calculates the first-order derivative of the function to compute the gradient or slope of
that function.
2. Move away from the direction of the gradient, which means slope increased from the
current point by alpha times, where Alpha is defined as Learning Rate. It is a tuning
parameter in the optimization process which helps to decide the length of the steps.
5.3.1.2 What is Cost-function?
The cost function is defined as the measurement of difference or error between actual
values and expected values at the current position and present in the form of a single real
number.
It helps to increase and improve machine learning efficiency by providing feedback to
this model so that it can minimize error and find the local or global minimum.
Further, it continuously iterates along the direction of the negative gradient until the cost
function approaches zero.
5.3.1.5
Types of Gradient Descent
Whenever the slope of the cost function is at zero or just close to zero, this model stops
learning further. Apart from the global minimum, there occur some scenarios that can show
this slop, which is saddle point and local minimum.
Local minima generate the shape similar to the global minimum, where the slope of the
cost function increases on both sides of the current points.
In contrast, with saddle points, the negative gradient only occurs on one side of the point,
which reaches a local maximum on one side and a local minimum on the other side.
2. Vanishing and Exploding Gradient
In a deep neural network, if the model is trained with gradient descent and
backpropagation Vanishing Gradient occurs when the gradient is smaller than expected. During
of iterations to reach the minima, because of its randomness in its descent. Even though it
requires a higher number of iterations to reach the minima than typical Gradient Descent, it is
still computationally much less expensive than typical Gradient Descent.
Hence, in most scenarios, SGD is preferred over Batch Gradient Descent for optimizing
a learning algorithm.
Parameter Initialization
In this, parameters, i.e., weights and biases, associated with an artificial neuron are
randomly initialized. After receiving the input, the network feeds forwards the input and it
Input values
X1=0.05 , X2=0.10
Initial weight
W1=0.15, W5=0.40 ,W2=0.20,w6=0.45, W3=0.25,W7=0.50, W4=0.30,W8=0.55
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1 and
H2. To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 fromthe weights as
y1=H1×w5+H2×w6+b2 y1=0.593269992×0.40+0.596884378×0.45+0.60y1=1.10590597
To calculate the final result of y1 we performed the sigmoid function as
Now, we will backpropagate this error to update the weights using a backward pass.
Step 2: Backward pass at the output layer
To update the weight, we calculate the error correspond to each weight with the help of a
totalerror. The error on weight w is calculated by differentiating total error with respect to
w.
From equation two, it is clear that we cannot partially differentiate it with respect to w5
because there is no any w5. We split equation one into multiple terms so that we can easily
differentiate it with respect to w5 as
Now, we calculate each term one by one to differentiate Etotal with respect to w5 as
So, we put the values of in equation no (3) to find the final result.
Now, we will calculate the updated weight w5new with the help of the following formula
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648 ,w6new=408666186, w7new=0.511301270, w8new=0.561370121
However, there are also some drawbacks to deep neural networks. One challenge is that
they can be more difficult to train than shallow networks, due to the increased number of
parameters and the risk of overfitting. Another challenge is that deep networks can be
computationally expensive to train and require a lot of data to achieve good performance.
Despite these challenges, deep neural networks have become a popular and powerful tool
in academic writing and research. They have been used to achieve state-of-the-art results on a
wide range of tasks, from image and speech recognition to natural language processing and
game playing. In particular, deep learning has shown great promise for advancing the field of
artificial intelligence and enabling machines to perform increasingly complex tasks.
In conclusion, the difference between deep neural networks and shallow networks lies in
the number of layers they contain. While deep networks offer improved accuracy and the ability
to perform end-to-end learning, they also present challenges in terms of training and
computational resources.
A shallow neural network has only one hidden layers netween the input and output
layers. The input layer receives the data, the hidden layers process it, and the final layer
produces the output.
Shallow neural network are simpler, more easily trained and have greater computational
efficiency than deep neural networks, which may have thousands of hidden units in dozens of
layers.
Shallow networks are typically used for simpler tasks such as linear regression, binary
classification, or low dimensional feature extraction.
Grid search is an exhaustive algorithm that can find the best combination of
hyperparameters.
5.7.3.4 Random search
The random search method (as its name implies) chooses values randomly rather than
using a predefined set of values like the grid search method.
Similarly, this transformation will take place for the second layer and go till the last
layer L as shown in the following image.
Although, our input X was normalized with time the output will no longer be on the same scale.
As the data go through multiple layers of the neural network and L activation functions are
applied, it leads to an internal co-variate shift in the data.
5.9 Regularization, Dropout
5.9.1 What is Regularization?
If you‘ve built a neural network before, you know how complex they are. This makes them more
prone to overfitting.
Assume that our regularization coefficient is so high that some of the weight
matrices are nearly equal to zero.
This will result in a much simpler linear network and slight underfitting of the
training data. Such a large value of the regularization coefficient is not that useful. We need
to optimize the value of regularization coefficient in order to obtain a well-fitted model as
shown in the image below.
In this, we penalize the absolute value of the weights. Unlike L2, the weights may be
reduced to zero here. Hence, it is very useful when we are trying to compress our model.
Otherwise, we usually prefer L2 over it.
5.9.2 Dropout
This is the one of the most interesting types of regularization techniques. It also
produces very good results and is consequently the most frequently used regularization
technique in the field of deep learning.
To understand dropout, let‘s say our neural network structure is akin to the one shown
So what does dropout do? At every iteration, it randomly selects some nodes and
removes them along with all of their incoming and outgoing connections as shown below.
So each iteration has a different set of nodes and this results in a different set of
outputs. It can also be thought of as an ensemble technique in machine learning.
This probability of choosing how many nodes should be dropped is the
hyperparameter of the dropout function. As seen in the image above, dropout can be
applied to both the hiddenlayers as well as the input layers.
5.9.2.1 Data Augmentation
The simplest way to reduce overfitting is to increase the size of the training data.
In machine learning, we were not able to increase the size of training data as the labeled
data wastoo costly.
But, now let‘s consider we are dealing with images. In this case, there are a few
ways of increasing the size of the training data – rotating the image, flipping, scaling,
shifting, etc. In the below image, some transformation has been done on the handwritten
digits dataset.
worse, we immediately stop the training on the model. This is known as early stopping.
In the above image, we will stop training at the dotted line since after that our model will start
overfitting on the training data.