Module 1
Module 1
of the brain. The main objective is to develop a system to perform various computational tasks faster than
the traditional systems. These tasks include pattern recognition and classification, approximation,
optimization, and data clustering.
Artificial Neural Network (ANN) is an efficient computing system whose central theme is borrowed from
the analogy of biological neural networks. ANNs are also named as “artificial neural systems,” or
“parallel distributed processing systems,” or “connectionist systems.” ANN acquires a large collection
of units that are interconnected in some pattern to allow communication between the units. These units,
also referred to as nodes or neurons, are simple processors which operate in parallel.
Every neuron is connected with other neuron through a connection link. Each connection link is
associated with a weight that has information about the input signal. This is the most useful information
for neurons to solve a particular problem because the weight usually excites or inhibits the signal that is
being communicated. Each neuron has an internal state, which is called an activation signal. Output
signals, which are produced after combining the input signals and activation rule, may be sent to other
units.
Biological Neuron
A nerve cell neuronneuron is a special biological cell that processes information. According to an
estimation, there are huge number of neurons, approximately 1011 with numerous interconnections,
approximately 1015.
Schematic Diagram
Soma Node
Dendrites Input
Axon Output
The following table shows the comparison between ANN and BNN based on some criteria mentioned.
Processing Massively parallel, slow but Massively parallel, fast but inferior than BNN
superior than ANN
Size 1011 neurons and 102 to 104 nodes mainly depends on the type of
1015 interconnections application and network design
Learning They can tolerate ambiguity Very precise, structured and formatted data is
required to tolerate ambiguity
Fault Performance degrades with even It is capable of robust performance, hence has
tolerance partial damage the potential to be fault tolerant
Storage Stores the information in the Stores the information in continuous memory
capacity synapse locations
The following diagram represents the general model of ANN followed by its processing.
For the above general model of artificial neural network, the net input can be calculated as follows −
yin=x1.w1+x2.w2+x3.w3…xm.wm
i.e., Net input yin= ∑𝑚
𝑖 𝑋𝑖. 𝑤𝑖
The output can be calculated by applying the activation function over the net input.
Y=F(yin)
Network Topology
A network topology is the arrangement of a network along with its nodes and connecting lines. According
to the topology, ANN can be classified as the following kinds −
Feedforward Network
It is a non-recurrent network having processing units/nodes in layers and all the nodes in a layer are
connected with the nodes of the previous layers. The connection has different weights upon them. There
is no feedback loop means the signal can only flow in one direction, from input to output. It may be
divided into the following two types −
• Single layer feedforward network − The concept is of feedforward ANN having only
one weighted layer. In other words, we can say the input layer is fully connected to the
output layer.
• Multilayer feedforward network − The concept is of feedforward ANN having more
than one weighted layer. As this network has one or more layers between the input and the
output layer, it is called hidden layers.
Feedback Network
As the name suggests, a feedback network has feedback paths, which means the signal can flow in both
directions using loops. This makes it a non-linear dynamic system, which changes continuously until it
reaches a state of equilibrium. It may be divided into the following types −
• Recurrent networks − They are feedback networks with closed loops. Following are the
two types of recurrent networks.
• Fully recurrent network − It is the simplest neural network architecture because all nodes
are connected to all other nodes and each node works as both input and output.
• Jordan network − It is a closed loop network in which the output will go to the input
again as feedback as shown in the following diagram.
Adjustments of Weights or Learning
Learning, in artificial neural network, is the method of modifying the weights of connections between
the neurons of a specified network. Learning in ANN can be classified into three categories namely
supervised learning, unsupervised learning, and reinforcement learning.
Supervised Learning
As the name suggests, this type of learning is done under the supervision of a teacher. This learning
process is dependent.
During the training of ANN under supervised learning, the input vector is presented to the network, which
will give an output vector. This output vector is compared with the desired output vector. An error signal
is generated, if there is a difference between the actual output and the desired output vector. On the basis
of this error signal, the weights are adjusted until the actual output is matched with the desired output.
Unsupervised Learning
As the name suggests, this type of learning is done without the supervision of a teacher. This learning
process is independent.
During the training of ANN under unsupervised learning, the input vectors of similar type are combined
to form clusters. When a new input pattern is applied, then the neural network gives an output response
indicating the class to which the input pattern belongs.
There is no feedback from the environment as to what should be the desired output and if it is correct or
incorrect. Hence, in this type of learning, the network itself must discover the patterns and features from
the input data, and the relation for the input data over the output.
Reinforcement Learning
As the name suggests, this type of learning is used to reinforce or strengthen the network over some critic
information. This learning process is similar to supervised learning, however we might have very less
information.
During the training of network under reinforcement learning, the network receives some feedback from
the environment. This makes it somewhat similar to supervised learning. However, the feedback obtained
here is evaluative not instructive, which means there is no teacher as in supervised learning. After
receiving the feedback, the network performs adjustments of the weights to get better critic information
in future.
The Artificial Neural Networks (ANNs) are computational models that are inspired from human brain. In
another words, it is the modelling of human brain work logic mathematically. The main goal is providing a
result(or output) that in line with our purpose after passing some processes. Just as the human brain has
billions of neurons, ANNs also has hundreds or thousands of artificial neurons.
ANNs are used for regression or classification problems and they consists of two basic architecture:
2. Weigths: Weight parameters (w) control the strength of the connection between inputs and
neurons. It can also be said to represent the effect of an independent variable on the result.
3. Bias value(b): It is a constant value that allows to control the output value. Also, when all
inputs are zero, it ensures that the process can still continue.
4. Activation Functions: The activation function (f) defines the output of the neuron according
to certain conditions.
5. Output: The dependent variable (y) is the result we want to find. In perceptrons, the result is
divided into two classes, classes 1 and 0.
The Perceptron
• The weighted sum is calculated by first multiplication of weights and inputs, then addition of
them. The bias value is included in this value (bias=b=x0×w0)
• The activation function is applied to the weighted sum(z) and the result is found. The
perceptrons use step function as a activation function.
According to step funciton, if the weigted sum;
The applied area of perceptrons is limited. Beacuse perceptrons are usually used for simple binary
classification problems that are linearly separable.
As the name suggests, the multi-layer neural networks, or the multi-layer perceptrons (MLPs), consist
layers more than one. Beside of the perceptrons, they can be used for non-linearly separable problems.
They achieve this with the activation functions they use in their layers. The activation functions make the
output of neurons nonlinear. In this way, it enables to solve more complex problems. (Without activation
function, ANNs actually become a linear regression model.)
• Hidden layers (one or more) → The number of neurons it consists depends on the problem.
2. It is transmitted to related hidden neuron, then the activation function present in the neuron is
applied.
In the next step, the outputs of hidden layers are transmitted to output layer. As said before, the number
of neurons depends on the problem in here:
The activation functions in neurons of output layer also depends on the task:
The main goal is to enable ANN to learn the most accurate weight values (so achiving most accurate
result) with correct hidden layer and neuron numbers. We can do this by applying certain processes in our
artificial neural network and optimizing it.
Activation Functions
It may be defined as the extra force or effort applied over the input to obtain an exact output. In ANN,
we can also apply activation functions over the input to get the exact output.
It’s just a thing function that you use to get the output of node. It is also known as Transfer Function.
It is used to determine the output of neural network like yes or no. It maps the resulting values in between
0 to 1 or -1 to 1 etc. (depending upon the function).
As you can see the function is a line or linear. Therefore, the output of the functions will not be confined
between any range.
Equation : f(x) = x
Range : (-infinity to infinity)
It doesn’t help with the complexity or various parameters of usual data that is fed to the neural
networks.
Non-linear Activation Function
The Nonlinear Activation Functions are the most used activation functions. Nonlinearity helps to makes
the graph look something like this
It makes it easy for the model to generalize or adapt with variety of data and to differentiate between the
output.
Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also known as slope.
The Nonlinear Activation Functions are mainly divided on the basis of their range or curves-
The function is differentiable. That means, we can find the slope of the sigmoid curve at any two points.
The logistic sigmoid function can cause a neural network to get stuck at the training time.
The softmax function is a more generalized logistic activation function which is used for multiclass
classification.
tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to 1). tanh is also
sigmoidal (s - shaped).
The advantage is that the negative inputs will be mapped strongly negative and the zero inputs will be
mapped near zero in the tanh graph.
Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
The ReLU is the most used activation function in the world right now. Since, it is used in almost all the
convolutional neural networks or deep learning.
As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than zero and f(z)
is equal to z when z is above or equal to zero.
Range: [ 0 to infinity)
But the issue is that all the negative values become zero immediately which decreases the ability of the
model to fit or train from the data properly. That means any negative input given to the ReLU activation
function turns the value into zero immediately in the graph, which in turns affects the resulting graph by
not mapping the negative values appropriately.
Curse of Dimensionality
Regarding the curse of dimensionality — also known as the Hughes Phenomenon — there are
two things to consider. On the one hand, ML excels at analyzing data with many dimensions.
Humans are not good at finding patterns that may be spread out across so many dimensions,
especially if those dimensions are interrelated in counter-intuitive ways. On the other hand, as
we add more dimensions we also increase the processing power we need to analyze the data,
and we also increase the amount of training data required to make meaningful data models.
High dimensional data is when a dataset a number of features (p) that is bigger than the number
of observations (N). High dimensional data is the problem that leads to the curse of
dimensionality. The equation for high dimensional data is usually written like p >> N.
The Hughes Phenomenon shows that as the number of features increases, the classifier’s
performance increases as well until we reach the optimal number of features. Adding more
features based on the same size as the training set will then degrade the classifier’s
performance.
An increase in the number of dimensions of a dataset means there are more entries in the vector
of features that represents each observation in the corresponding Euclidean space. We measure
the distance in a vector space using Euclidean distance.
Hence, each new dimension adds a non-negative term to the sum, so the distance increases with
the number of dimensions for distinct vectors. In other words, as the number of features grows
for a given number of observations, the feature space becomes increasingly sparse; that is, less
dense or emptier. On the flip side, the lower data density requires more observations to keep
the average distance between data points the same.
When the distance between observations grows, supervised machine learning becomes more
difficult because predictions for new samples are less likely to be based on learning from
similar training features. The number of possible unique rows grows exponentially as the
number of features increases, which makes it so much harder to efficiently generalize. The
variance increases as they get more opportunity to overfit to noise in more dimensions,
resulting in poor generalization performance.
Dimensionality reduction techniques help compress the data without losing much of the signal,
and combat the curse of dimensionality.
In supervised learning, overfitting happens when our model captures the noise along with the
underlying pattern in data. It happens when we train our model a lot over noisy dataset. These
models have low bias and high variance. These models are very complex like Decision trees
which are prone to overfitting.
In supervised learning, underfitting happens when a model unable to capture the underlying
pattern of the data. These models usually have high bias and low variance. It happens when we
have very less amount of data to build an accurate model or when we try to build a linear model
with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns
in data like Linear and logistic regression.
What is bias?
Bias is the difference between the average prediction of our model and the correct
value whichwe are trying to predict. Model with high bias pays very little attention to
the training data andoversimplifies the model. It always leads to high error on training
and test data.
What is variance?
Variance is the variability of model prediction for a given data point or a value
which tells usspread of our data. Model with high variance pays a lot of attention to
training data and does not generalize on the data which it hasn’t seen before. As a
result, such models perform very well on training data but has high error rates on
test data.
If our model is too simple and has very few parameters then it may have high bias and
low variance. On the other hand if our model has large number of parameters then it’s
going to have high variance and low bias. So we need to find the right/good balance
without overfittingand underfitting the data. This tradeoff in complexity is why there
is a tradeoff between bias and variance. An algorithmcan’t be more complex and less
complex at the same time.
When training a deep neural network with gradient based learning and backpropagation, we find
the partial derivatives by traversing the network from the the final layer to the initial layer. Using
the chain rule, layers that are deeper into the network go through continuous matrix
multiplications in order to compute their derivatives.
In a network of n hidden layers, n derivatives will be multiplied together. If the derivatives are
large then the gradient will increase exponentially as we propagate down the model until they
eventually explode, and this is what we call the problem of exploding gradient. Alternatively,
if the derivatives are small then the gradient will decrease exponentially as we propagate through
the model until it eventually vanishes, and this is the vanishing gradient problem.
In the case of exploding gradients, the accumulation of large derivatives results in the model
being very unstable and incapable of effective learning, The large changes in the models weights
creates a very unstable network, which at extreme values the weights become so large that is
causes overflow resulting in NaN weight values of which can no longer be updated. On the other
hand, the accumulation of small gradients results in a model that is incapable of learning
meaningful insights since the weights and biases of the initial layers, which tends to learn the
core features from the input data (X), will not be updated effectively. In the worst case scenario
the gradient will be 0 which in turn will stop the network will stop further training.
How to know?
Exploding Gradients
There are few subtle methods that you may use to determine whether your model is suffering
from the problem of exploding gradients;
• The model is not learning much on the training data therefore resulting in a poor
loss.
• The model will have large changes in loss on each update due to the models
instability.
When faced with these problems, to confirm whether the problem is due to exploding
gradients, there are some much more transparent signs, for instance:
• Model weights grow exponentially and become very large when training the
model.
Vanishing Gradient
There are also ways to detect whether your deep network is suffering from the vanishing
gradient problem
• The model will improve very slowly during the training phase and it is also
possible that training stops very early, meaning that any further training does
not improve the model.
• The weights closer to the output layer of the model would witness more of a
change whereas the layers that occur closer to the input layer would not change
much (if at all).
• Model weights shrink exponentially and become very small when training the
model.
Solutions
There are many approaches to addressing exploding and vanishing gradients; this section lists 3
approaches that you can use.
This is the solution could be used in both, scenarios (exploding and vanishing gradient).
However, by reducing the amount of layers in our network, we give up some of our models
complexity, since having more layers makes the networks more capable of representing
complex mappings.
Checking for and limiting the size of the gradients whilst our model trains is another solution.
3. Weight Initialization
A more careful initialization choice of the random initialization for your network tends to be a
partial solution, since it does not solve the problem completely.
In general deep learning modelling, we formulate a problem using the neuron and layers of the
network and expect the problem to come up with a loss function. At the same time, the training
of models includes weights as parameters. When including backpropagation with the model,
the process of backpropagation starts when the errors defined by the loss function reach a
defined point.
Every iteration in the training tries to reach closer to that point and at this point, the error value
gets minimized by updating the weights. This model includes a set of weights associated with
the loss function. The main goal of modelling is to find the minimum value loss at every
iteration and overall operation.
In simple words, we can say that convergence of neural networks is a point of training a model
after which changes in the learning rate become lower and the errors produced by the model in
training comes to a minimum. We can also say that a deep learning model is in convergence
when the loss given by the model reaches its minimum. The convergence can be of two types
either global or local. One thing that is noticeable here is that convergence should happen with
a decreasing trend. However, In a variety of modelling procedures, it is very rare to see a model
converge very strictly but it is common to see the model converge in a convex manner.
The above image is a representation of the convergence where we can see that the training of
the model after the 20th iteration becomes converged and the errors after the 20th iteration are
lower, decremental and within a smaller range.
By the above, we can say that the convergence in the model is important while training makes
us decide whether to proceed with the model or not. One of our articles consists of information
about how to converge the neural network faster. This article is focused on the information
when the neural network fails to converge. Let’s take a look at what fails to converge means.
In simple words, we can think of failure in convergence as a condition where we can’t find the
convergence point in the learning curve of a neural network. It directly means there is no such
point in the curve which can be represented as the starting point of getting lower and
decremental error. We can understand the failure in the convergence by looking at the below
image.
In the above image, we can see that the errors are decremental as the count of iteration is
increasing but one different thing is we can not tell from which point the error is varying within
a smaller range. For what are the global or local minima of the errors? In such a situation, we
can say that the neural network is failed to converge. Let’s see why it happens.
Most of the neural network fails to converge because of an error in the modelling. Let us say
the data is required to transform within the network and the nodes we have provided in the
networks are way smaller in number. In such a situation how can we expect the network to
work properly? So in the majority of the cases when the network fails to converge, it comes
into the picture because of inaccurate modelling. Some of the reasons behind this thing are as
follows:
• Implementation of not enough nodes may be a reason behind this issue because models
with fewer nodes need to change their architecture drastically to model the data better
and fail to converge.
• The amount of the training data is low or the data we are pushing on the model is
corrupted or not collected with the data integrity.
• The activation function we are using with the network often leads to good results from
the model but if complexity is higher then the model can fail to converge.
• Inappropriate weight application in the network can also cause a failure in convergence.
The weights we are applying to the network should be well calculated according to the
activation function.
• The learning rate parameter we have given in the network should be moderate which
means it should not be much larger or much lower.
In the above section, we have discussed the reason that can cause failure in the convergence of
the neural networks. There are various things to do that can help in avoiding this failure. Let’s
take a look at some points that can help us in preventing the failure in the convergence of the
neural networks.
• Implementing momentum: sometimes convergence depends on the data and if the data
is making a model producing errors like a hair comb. The implementation of neural
network momentum can help in avoiding convergence and also helps in boosting the
accuracy and speed of the model.
• Reinitialization of the weights of the network can help in avoiding the failure of
convergence.
• If the training is stuck in the local minima and subsequent sessions have exceeded max
iteration, this means the session has failed and we will get a higher error. In such a
situation starting another session can be helpful.
• Change in the activation function can be helpful. For example, we are using a ReLU
activation and the neurons of the nodes become biased and this can cause the neuron to
never be activated. In such a situation changing the activation function to another
activation can be helpful.
• While performing classification using neural networks, then we can use the shuffling
of the training data to avoid the failure in convergence.
• The learning rate and the number of epochs should be proportional while modelling a
network. Applying a lower number of epochs causes the convergence to happen in
smaller steps and a bigger number of epochs there will mean a long wait in the
appearance of the convergence. A higher learning rate or the number of epochs should
be avoided to make the neural network converge faster.
Perceptron (P):
Applications:
• Classification.
• Encode Database (Multilayer Perceptron).
• Monitor Access Data (Multilayer Perceptron).
Applications:
• Data Compression.
• Pattern Recognition.
• Computer Vision.
• Sonar Target Recognition.
• Speech Recognition.
• Handwritten Characters Recognition.
Applications:
• Data Compression.
• Pattern Recognition.
• Computer Vision.
• ECG Noise Filtering.
• Financial Prediction.
Applications:
• Machine Translation.
• Robot Control.
• Time Series Prediction.
• Speech Recognition.
• Speech Synthesis.
• Time Series Anomaly Detection.
• Rhythm Learning.
• Music Composition.
Applications:
• Speech Recognition.
• Writing Recognition.
Applications: