DL Mod1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 58

DEEP LEARNING

MODULE 1
1.1 INTRODUCTION

🡪Artificial Intelligence – Mimic Human Brain


🡪Machine Learning— Branch of Artificial Intelligence. It focuses on the use of data and
algorithms to imitate the way that humans learn, gradually improving its accuracy.
🡪Deep learning -- is a subset of machine learning, which is essentially a neural network
with three or more layers. These neural networks attempt to simulate the behavior of the
human brain,allowing it to “learn” from large amounts of data. While a neural network
with a single layer can still make approximate predictions, additional hidden layers can
help to optimize and refine for accuracy.
NEURON
🡪Neurons are the fundamental unit of the nervous system specialized to transmit
information to different parts of the body.
🡪Neurons are the building blocks of the nervous system.
🡪They receive and transmit signals to different parts of the body.
🡪This is carried out in both physical and electrical forms.
🡪There are several different types of neurons that facilitate the transmission of
information.

🡪The sensory neurons carry information from the sensory receptor cells present
throughout the body to the brain.
🡪The motor neurons transmit information from the brain to the muscles.
🡪The interneurons transmit information between different neurons in the body.
• All neurons have three different parts – dendrites, cell body and axon.
• Parts of Neuron
• Following are the different parts of a neuron:
• Dendrites
• These are branch-like structures that receive messages from other neurons and
allow the transmission of messages to the cell body.
• Cell Body
• Each neuron has a cell body with a nucleus, Golgi body, endoplasmic reticulum,
mitochondria and other components.
• Axon
• Axon is a tube-like structure that carries electrical impulse from the cell body to
the axon terminals that pass the impulse to another neuron.
Synapse
• It is the chemical junction between the terminal of one neuron and the dendrites of
another neuron.
Artificial Neural Network
🡪An Artificial neural network is usually a computational network based on biological neural
networks that construct the structure of the human brain.
🡪Similar to a human brain, it has neurons interconnected to each other, artificial neural networks
also have neurons that are linked to each other in various layers of the networks.
🡪These neurons are known as nodes.
Biological Neural Network Artificial Neural Network
Dendrites Inputs

Cell nucleus Nodes

Synapse Weights

Axon Output

• Artificial Neural Network primarily consists of three layers:

• Input Layer:
• As the name suggests, it accepts inputs in several different formats provided by the
programmer.
• Hidden Layer:
• The hidden layer presents in-between input and output layers. It performs all the
calculations to find hidden features and patterns.
• Output Layer:
• The input goes through a series of transformations using the hidden layer, which finally
results in output that is conveyed using this layer.
• The artificial neural network takes input and computes the weighted sum of the inputs
and includes a bias. This computation is represented in the form of a transfer function.

• It determines weighted total is passed as an input to an activation function to produce the


output.
• Activation functions choose whether a node should fire or not.
• Only those who are fired make it to the output layer.
• There are distinctive activation functions available that can be applied upon the sort of
task we are performing.

Advantages of Artificial Neural Network (ANN)


1. Parallel processing capability:
Artificial neural networks have a numerical value that can perform more than one
task simultaneously.
2. Storing data on the entire network:
3. Capability to work with incomplete knowledge:
After ANN training, the information may produce output even with inadequate
data. The loss of performance here relies upon the significance of missing data.
4. Having a memory distribution:
For ANN is to be able to adapt, it is important to determine the examples and to
encourage the network according to the desired output by demonstrating these examples
to the network. produce false output.
5. Having fault tolerance:
Extortion of one or more cells of ANN does not prohibit it from generating
output, and this feature makes the network fault-tolerance.

Disadvantages of Artificial Neural Network:


1. Assurance of proper network structure:
There is no particular guideline for determining the structure of artificial neural
networks. The appropriate network structure is accomplished through experience, trial,
and error.
2. Unrecognized behavior of the network:
It is the most significant issue of ANN. When ANN produces a testing solution, it
does not provide insight concerning why and how. It decreases trust in the network.
3. Hardware dependence:
Artificial neural networks need processors with parallel processing power, as per
their structure. Therefore, the realization of the equipment is dependent.
4. Difficulty of showing the issue to the network:
ANNs can work with numerical data. Problems must be converted into numerical
values before being introduced to ANN. The presentation mechanism to be resolved here
will directly impact the performance of the network. It relies on the user's abilities.
5. The duration of the network is unknown:
The network is reduced to a specific value of the error, and this value does not
give us optimum results.
How do artificial neural networks work?
🡪Artificial Neural Network can be best represented as a weighted directed graph, where the
artificial neurons form the nodes.
🡪 The association between the neurons outputs and neuron inputs can be viewed as the directed
edges with weights.
🡪The Artificial Neural Network receives the input signal from the external source in the form of
a pattern and image in the form of a vector.
🡪These inputs are then mathematically assigned by the notations x(n) for every n number of
inputs.
🡪Afterward, each of the input is multiplied by its corresponding weights ( these weights are the
details utilized by the artificial neural networks to solve a specific problem ).
🡪In general terms, these weights normally represent the strength of the interconnection between
neurons inside the artificial neural network.
🡪All the weighted inputs are summarized inside the computing unit.
🡪If the weighted sum is equal to zero, then bias is added to make the output non-zero or
something else to scale up to the system's response.
🡪Bias has the same input, and weight equals to 1.
🡪Here the total of weighted inputs can be in the range of 0 to positive infinity.
🡪The activation function refers to the set of transfer functions used to achieve the desired output.
🡪There is a different kind of the activation function, but primarily either linear or non-linear sets
of functions.
🡪Some of the commonly used sets of activation functions are the Binary, linear, and Tan
hyperbolic sigmoidal activation functions.
Types of Artificial Neural Network
• Feedforward ANN
🡪Its flow is uni-directional, meaning that the information in the model flows in only one
direction—forward—from the input nodes, through the hidden nodes (if any) and to the output
nodes, without any cycles or loops.

• Feedback ANN
🡪In this type of ANN, the output returns into the network to accomplish the best-evolved results
internally
🡪The feedback networks feed information back into itself and are well suited to solve
optimization issues.
🡪The Internal system error corrections utilize feedback ANN

Types of Neural Network Architectures

Neural networks are an efficient way to solve machine learning problems and can be used in
various situations. Neural networks offer precision and accuracy. Finding the correct neural
network for each project can increase efficiency.

Standard neural networks

● Perceptron - A neural network that applies a mathematical operation to an input value,


providing an output variable.
● Feed-Forward Networks - A multi-layered neural network where the information moves
from left to right, or in other words, in a forward direction. The input values pass through
a series of hidden layers on their way to the output layer.
● Residual Networks (ResNet) - A deep feed-forward network with hundreds of layers.
Recurrent neural networks

Recurrent neural networks (RNNs) remember previously learned predictions to help make future
predictions with accuracy.

● Long short term memory network (LSTM) - LSTM adds extra structures, or gates, to an
RNN to improve memory capabilities.
● Echo state network (ESN) - A type of RNN hidden layers that are sparsely connected.

Convolutional neural networks

Convolutional neural networks (CNNs) are a type of feed-forward network that are used for
image analysis and language processing. There are hidden convolutional layers that form
ConvNets and detect patterns. CNNs use features such as edges, shapes, and textures to detect
patterns. Examples of CNNs include:

● AlexNet - Contains multiple convolutional layers designed for image recognition.


● Visual geometry group (VGG) - VGG is similar to AlexNet, but has more layers of
narrow convolutions.
● Capsule networks - Contain nested capsules (groups of neurons) to create a more
powerful CNN.

Generative adversarial networks

Generative adversarial networks (GAN) are a type of unsupervised learning where data is
generated from patterns that were discovered from the input data. GANs have two main parts
that compete against one another:

● Generator - creates synthetic data from the learning phase of the model. It will take
random datasets and generate a transformed image.
● Discriminator - decides whether or not the images produced are fake or genuine.

GANs are used to help predict what the next frame in a video might be, text to image generation,
or image to image translation.

Transformer neural networks

Unlike RNNs, transformer neural networks do not have a concept of timestamps. This enables
them to pass through multiple inputs at once, making them a more efficient way to process data.

PERCEPTRON
• It is one of the oldest and first introduced neural networks.
• It was proposed by Frank Rosenblatt in 1958.
• Perceptron is also known as an artificial neural network.
• Perceptron is mainly used to compute the logical gate like AND, OR, and NOR which
has binary input and binary output.
• Perceptron is a building block of an Artificial Neural Network.
• Perceptron is a linear Machine Learning algorithm used for supervised learning for
various binary classifiers. This algorithm enables neurons to learn elements and processes
them one by one during preparation.
The main functionality of the perceptron is:-
• Takes input from the input layer
• Weight them up and sum it up.
• Pass the sum to the function to produce the output.

🡪Activation functions can be anything like sigmoid, tanh, relu

• Input Nodes or Input Layer:


• This is the primary component of Perceptron which accepts the initial data into the
system for further processing. Each input node contains a real numerical value.
• Wight and Bias:
• Weight parameter represents the strength of the connection between units.
• Weight is directly proportional to the strength of the associated input neuron in deciding
the output.
• Bias can be considered as the line of intercept in a linear equation.
• Activation Function:
• These are the final and important components that help to determine whether the neuron
will fire or not. Activation Function can be considered primarily as a step function.
• Types of Activation functions:
• Sign function
• Step function, and
• Sigmoid function
How does Perceptron work?
• In Machine Learning, Perceptron is considered as a single-layer neural network that
consists of four main parameters named input values (Input nodes), weights and Bias, net
sum, and an activation function.
• The perceptron model begins with the multiplication of all input values and their weights,
then adds these values together to create the weighted sum.
• Then this weighted sum is applied to the activation function 'f' to obtain the desired
output.
• This activation function is also known as the step function and is represented by 'f'.

Perceptron model works in two important steps as follows:

Step-1
In the first step first, multiply all input values with corresponding weight values and then add
them to determine the weighted sum. Mathematically, we can calculate the weighted sum as
follows:
∑wi*xi = x1*w1 + x2*w2 +…wn*xn
Add a special term called bias 'b' to this weighted sum to improve the model's performance.
∑wi*xi + b
Step-2
In the second step, an activation function is applied with the above-mentioned weighted sum,
which gives us output either in binary form or a continuous value as follows:
Y = f(∑wi*xi + b)
Types of Perceptron Models
• Based on the layers, Perceptron models are divided into two types. These are as follows:
• Single-layer Perceptron Model
• Multi-layer Perceptron model

Single-layer Perceptron Model


• This is one of the easiest Artificial neural networks (ANN) types.
• A single-layered perceptron model consists feed-forward network and also includes a
threshold transfer function inside the model.
• The main objective of the single-layer perceptron model is to analyze the linearly
separable objects with binary outcomes.
• In a single layer perceptron model, its algorithms do not contain recorded data, so it
begins with inconstantly allocated input for weight parameters.
• Further, it sums up all inputs (weight).
• After adding all inputs, if the total sum of all inputs is more than a pre-determined value,
the model gets activated and shows the output value as +1.
• If the outcome is same as pre-determined or threshold value, then the performance of this
model is stated as satisfied, and weight demand does not change.
• Hence, to find desired output and minimize errors, some changes should be necessary for
the weights input.
• "Single-layer perceptron can learn only linearly separable patterns."
Q. Explain the limitation of single layer perceptron.
• One of the main disadvantages of using a single-layer perceptron is its limited
expressive power and generalization ability.
• It cannot learn to classify non-linearly separable patterns, such as XOR, circles, or
spirals.
• It is also prone to overfitting and noise, as it tries to fit a straight line to the data.
• It does not have any hidden layers that can introduce non-linearity and flexibility to
the model.
• A "single-layer" perceptron can't implement XOR. The reason is because the classes
in XOR are not linearly separable. You cannot draw a straight line to separate the
points (0,0),(1,1) from the points (0,1),(1,0).
• Truth table of XOR

.Here when we apply w1x1+w2x2,
Case 1
W1*0+W2*0=0, that is actual output<threshold. So neuron does not fire.
Case 2
• 0.w1 + 1.w2 >= t, so it causes a fire
Case 3
• 1.w1 + 0.w2 cause a fire, i.e. >= t
• Case 4
• 1.w1 + 1.w2 also doesn't fire, < t
Here
• w1 >= t
• w2 >= t
• w1+w2 < t
• Contradiction.

Multi-Layered Perceptron Model:


• Like a single-layer perceptron model, a multi-layer perceptron model also has the same
model structure but has a greater number of hidden layers.
• The multi-layer perceptron model is also known as the Backpropagation algorithm, which
executes in two stages as follows:
• Forward Stage: Activation functions start from the input layer in the forward stage and
terminate on the output layer.
• Backward Stage: In the backward stage, weight and bias values are modified as per the
model's requirement. In this stage, the error between actual output and demanded
originated backward on the output layer and ended on the input layer.
• Hence, a multi-layered perceptron model has considered as multiple artificial neural
networks having various layers in which activation function does not remain linear,
similar to a single layer perceptron model. Instead of linear, activation function can be
executed as sigmoid, TanH, ReLU, etc., for deployment.
• A multi-layer perceptron model has greater processing power and can process linear and
non-linear patterns.
• Further, it can also implement logic gates such as AND, OR, XOR, NAND, NOT,
XNOR, NOR.
Advantages of Multi-Layer Perceptron:
• A multi-layered perceptron model can be used to solve complex non-linear problems.
• It works well with both small and large input data.
• It helps us to obtain quick predictions after the training.
• It helps to obtain the same accuracy ratio with large as well as small data
MULTI LAYER NEURAL NETWORK
• Multilayer neural networks contain more than one computational layer.
• Multilayer neural networks contain multiple computational layers;
• the additional intermediate layers (between input and output) are referred to as hidden
layers because the computations performed are not visible to the user.
The specific architecture of multilayer neural networks is referred to as feed-forward networks,
because successive layers feed into one another in the forward direction from input to output.
• The default architecture of feed-forward networks assumes that all nodes in one layer are
connected to those of the next layer. Therefore, the architecture of the neural network is
almost fully defined, once the number of layers and the number/type of nodes in each
layer have been defined.
• The only remaining detail is the loss function that is optimized in the output layer.
• The loss function is the function that computes the distance between the current
output of the algorithm and the expected output. It’s a method to evaluate how your
algorithm models the data.

• The number of units in each layer is referred to as the dimensionality of that layer.

🡪To be accurate a fully connected Multi-Layered Neural Network is known as Multi-Layer


Perceptron.
🡪A Multi-Layered Neural Network consists of multiple layers of artificial neurons or nodes
• Suppose we have xn inputs(x1, x2….xn) and a bias unit. Let the weight applied to be w1,
w2…..wn. Then find the summation and bias unit on performing dot product among
inputs and weights as:
• On feeding the r into activation function F(r) we find the output for the hidden layers. For
the first hidden layer h1, the neuron can be calculated as:

• For all the other hidden layers repeat the same procedure. Keep repeating the process
until reach the last weight set.
ACTIVATION FUNCTION
• It’s a function that we use to get the output of node. It is also known as Transfer
Function.
• The primary role of the Activation Function is to transform the summed weighted input
from the node into an output value to be fed to the next hidden layer or as output.
• It is used in neural network to determine the output of neural network like yes or no. It
maps the resulting values in between 0 to 1 or -1 to 1 etc. (depending upon the function).
• The Activation Functions can be basically divided into 2 types-
1. Linear Activation Function
2. Non-linear Activation Functions
• The main terminologies needed to understand for nonlinear functions are:
• Derivative or Differential: Change in y-axis w.r.t. change in x-axis.It is also
known as slope.
• Monotonic function: A function which is either entirely non-increasing or
non-decreasing.
1. Sigmoid or Logistic Activation Function
• The Sigmoid Function curve looks like a S-shape.


• The main reason why we use sigmoid function is because it exists between (0 to 1).
• Therefore, it is especially used for models where we have to predict the probability as an
output.
• Since probability of anything exists only between the range of 0 and 1, sigmoid is the
right choice.
• The function is differentiable.That means, we can find the slope of the sigmoid curve at
any two points.
• The function is monotonic but function’s derivative is not.

2. Tanh or hyperbolic tangent Activation Function


tanh is also like logistic sigmoid but better. The range of the tanh function is from (-1 to
1). tanh is also sigmoidal (s - shaped).

• The function is differentiable.


• The function is monotonic while its derivative is not monotonic.
• The tanh function is mainly used classification between two classes.
• Both tanh and logistic sigmoid activation functions are used in feed-forward nets.
3. ReLU (Rectified Linear Unit) Activation Function
• The ReLU is the most used activation function in the world right now.Since, it is used in
almost all the convolutional neural networks or deep learning.

• As you can see, the ReLU is half rectified (from bottom). f(z) is zero when z is less than
zero and f(z) is equal to z when z is above or equal to zero.
• Range: [ 0 to infinity)
• The function and its derivative both are monotonic.
• But the issue is that all the negative values become zero immediately which decreases the
ability of the model to fit or train from the data properly.
• That means any negative input given to the ReLU activation function turns the value into
zero immediately in the graph, which in turns affects the resulting graph by not mapping
the negative values appropriately.
4. Leaky ReLU
• It is an attempt to solve the dying ReLU problem

• The leak helps to increase the range of the ReLU function. Usually, the value of a is 0.01
or so.
• When a is not 0.01 then it is called Randomized ReLU.
• Therefore the range of the Leaky ReLU is (-infinity to infinity).
• Both Leaky and Randomized ReLU functions are monotonic in nature. Also, their
derivatives also monotonic in nature.

LEAKY ReLU
F(X)=max(0.01x,x)
5. Softmax activation function
• The softmax activation function takes in a vector of raw outputs of the neural network
and returns a vector of probability scores.
• In the vector z of raw outputs, the maximum value is 1.23, which on applying softmax
activation maps to 0.664: the largest entry in the softmax output vector. Likewise, 0.25
and -0.8 map to 0.249 and 0.087: the second and the third largest entries in the softmax
output respectively. Thus, applying softmax preserves the relative ordering of scores.
• All entries in the softmax output vector are between 0 and 1.
• In a multiclass classification problem, where the classes are mutually exclusive, notice
how the entries of the softmax output sum up to 1: 0.664 + 0.249 + 0.087 = 1.

6. Hardtanh Activation Function


• Hardtanh is an activation function used for neural networks:

• The hard tanh activation function is a modified version of the tanh function that applies a
threshold to the output to produce an output between -1 and 1.
• The hard tanh function is faster to compute than the tanh function and is commonly used
in embedded systems and real-time applications.
LOSS FUNCTION
🡪Neural networks are a set of algorithms that are designed to recognize
trends/relationships in a given set of training data. These algorithms are based on the way
human neurons process information.
🡪This equation represents how a neural network processes the input data at each layer
and eventually produces a predicted output value.

🡪To train — the process by which the model maps the relationship between the training
data and the outputs — the neural network updates its hyperparameters, the weights, wT,
and biases, b, to satisfy the equation above.
🡪Each training input is loaded into the neural network in a process called forward
propagation. Once the model has produced an output, this predicted output is compared
against the given target output in a process called backpropagation — the
hyperparameters of the model are then adjusted so that it now outputs a result closer to
the target output.
• A loss function is a function that compares the target and predicted output values;
measures how well the neural network models the training data. When training, we aim to
minimize this loss between the predicted and target outputs.
• The hyperparameters are adjusted to minimize the average loss — we find the weights,
wT, and biases, b, that minimize the value of J (average loss).

• Types of Loss Functions
• In supervised learning, there are two main types of loss functions : regression and
classification loss functions
• Regression Loss Functions — used in regression neural networks; given
an input value, the model predicts a corresponding output value (rather
than pre-selected labels); Ex. Mean Squared Error, Mean Absolute Error
• Classification Loss Functions — used in classification neural networks;
given an input, the neural network produces a vector of probabilities of the
input belonging to various pre-set categories — can then select the
category with the highest probability of belonging; Ex. Binary
Cross-Entropy, Categorical Cross-Entropy
• Mean Squared Error (MSE)
• One of the most popular loss functions, MSE finds the average of the squared differences
between the target and the predicted outputs


• Mean Absolute Error (MAE)
• MAE finds the average of the absolute differences between the target and the predicted
outputs.

• Binary Cross-Entropy/Log Loss


• This is the loss function used in binary classification models — where the model takes in
an input and has to classify it into one of two pre-set categories.
• Categorical Cross-Entropy Loss
• In cases where the number of classes is greater than two, we utilize categorical
cross-entropy — this follows a very similar process to binary cross-entropy.

TRAINING A NEURAL NETWORK WITH BACKPROPAGATION


• Backpropagation is the essence of neural network training.
• It is the method of fine-tuning the weights of a neural network based on the error rate
obtained in the previous epoch (i.e., iteration).
• Proper tuning of the weights allows you to reduce error rates and make the model reliable
by increasing its generalization.
• The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule.
• It efficiently computes one layer at a time
• It contains two main phases,referred to as the forward and backward phases,

• Inputs X, arrive through the preconnected path
• Input is modeled using real weights W. The weights are usually randomly selected.
• Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
• Calculate the error in the output
• ERROR=TARGET-ACTUAL
• Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.


Practical Issues in Neural Network Training
• I.The Problem of Overfitting
• II. The Vanishing and Exploding Gradient Problems
• III.Difficulties in Convergence
• IV.Local and Spurious Optima
• V. Computational Challenges

• I.The Problem of Overfitting


• The primary objective in deep learning is to have a network that performs its best on both
training data & the test data/new data it hasn’t seen before.
• Overfitting & Underfitting is a common occurrence encountered .
• There is always a gap between the training and test data performance, which is
particularly large when the models are complex and the data set is small.
• When the network tries to learn too much or too many details in the training data along
with the noise from the training data which results in poor performance on unseen or test
dataset. When this happens the network fails to generalize the features/pattern found in
the training data.
• Error vs iteration graph


• Overfitting during training can be spotted when the error on training data decreases to a
very small value but the error on the new data or test data increases to a large value.
• The error vs iteration graph shows how a deep neural network overfits on training data.
• The blue curve indicates the error on training data & the red curve the error on test data.
• The point where the green line intersects is the instance the network begins to overfit.
• As you can see, the error on test data increases sharply while error on training data
decreases.
• A new set of data points will result in the model/network performing poorly as it is very
close to all the training points which are noise & outliers.
• The error on the training points is minimum or very small but the error on the new data
points will be high.
• One of the main reasons for the network to overfit is if the size of the training dataset is
small.
• When the network tries to learn from a small dataset it will tend to have greater control
over the dataset & will make sure to satisfy all the datapoints exactly.

• In order to understand this point, consider a simple single-layer neural network on a data
set with five attributes, where we use the identity activation to learn a real-valued target
variable.
• Consider a situation in which the observed target value is real and is always twice the
value of the first attribute, whereas other attributes are completely unrelated to the target.
However, we have only four training instances, which is one less than the number of
features (free parameters). For example, the training instances could be as follows:

🡪The correct parameter vector in this case is W = [2, 0, 0, 0, 0] based on the known relationship
between the first feature and target.
🡪The training data also provides zero error with this solution, although the relationship needs to
be learned from the given instances
🡪However, the problem is that the number of training points is fewer than the number of
parameters and it is possible to find an infinite numberof solutions with zero error.
🡪For example, the parameter set [0, 2, 4, 6, 8] also provides zero error on the training data.
🡪However, if we used this solution on unseen test data, it is likely to provide very poor
performance because the learned parameters are spuriously inferred and are unlikely to
generalize well to new points in which the target is twice the first attribute (and other attributes
are random).
🡪As a result,the solution does not generalize well to unseen test data.

Underfitting
Underfitting happens when the network can neither model the training or test data which results
in overall bad performance.
By looking at the graph, the model doesn’t cover all the data points & has a high error on both
training & test data.
The reason for underfitting can be because of the limited capacity of the network, a limited
number of features provided as input to the network, noisy data etc.
• It represents the inability of the model to learn the training data effectively result in poor
performance both on the training and testing data.
• In simple terms, an underfit model’s are inaccurate, especially when applied to new,
unseen examples.
• It mainly happens when we uses very simple model with overly simplified assumptions.
• To address underfitting problem of the model, we need to use more complex models, with
enhanced feature representation, and less regularization.
• Note: The underfitting model has High bias and low variance

Reasons for Underfitting


The model is too simple, So it may be not capable to represent the complexities in the
data.
The input features which is used to train the model is not the adequate representations of
underlying factors influencing the target variable.
The size of the training dataset used is not enough.
Excessive regularization are used to prevent the overfitting, which constraint the model to
capture the data well.
Features are not scaled
Techniques to Reduce Underfitting
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.
O
Measures to prevent overfitting
1. Decrease the network complexity
🡪Deep neural networks like CNN are prone to overfitting because of the millions or billions of
parameters it encloses.
🡪By removing certain layers or decreasing the number of neurons (filters in CNN) the network
becomes less prone to overfitting as the neurons contributing to overfitting are removed or
deactivated.
🡪 The network also has a reduced number of parameters because of which it cannot memorize
all the data points & will be forced to generalize.

🡪There is no general rule as to how many layers are to be removed or how many neurons must
be in a layer before the network can overfit.
🡪The popular approach for reducing the network complexity is
--Grid search can be applied to find out the number of neurons and/or layers to reduce or
remove overfitting.
--The overfit model can be pruned (trimmed) by removing nodes or connections until it
reaches suitable performance on test data.
2. Data Augmentation
🡪One of the best strategies to avoid overfitting is to increase the size of the training
dataset.
🡪As discussed, when the size of the training data is small the network tends to have
greater control over the training data.
🡪But in real-world scenarios gathering of large amounts of data is a tedious &
time-consuming task, hence the collection of new data is not a viable option.

🡪Data augmentation provides techniques to increase the size of existing training data
without any external addition.
🡪If our training data consists of images, image augmentation techniques like rotation,
horizontal & vertical flipping, translation, increasing or decreasing the brightness or adding
noise, cutouts etc can be applied to the existing training images to increase the number of
instances.
🡪By applying the above-mentioned data augmentation strategies, the network is trained
on multiple instances of the same class of object in different perspectives.
🡪An augmented result of a lion’s photograph will have an instance of a lion being
viewed in a rotated manner, a lion being viewed up-side-down or cutting out the portion of an
image which encloses the mane of a lion.
🡪By applying the last augmentation (cutout) the network learns to associate the feature
that male lions have a mane with its class.
3. Weight Regularization
🡪Weight regularization is a technique which aims to stabilize an overfitted network by
penalizing the large value of weights in the network.
🡪An overfitted network usually presents with problems with a large value of weights as a
small change in the input can lead to large changes in the output.
🡪For instance, when the network is given new or test data, it results in incorrect
predictions.
🡪Weight regularization penalizes the network’s large weights & forcing the optimization
algorithm to reduce the larger weight values to smaller weights, and this leads to stability of the
network & presents good performance.
🡪In weight regularization, the network configuration remains unchanged only modifying
the value of weights.
🡪Weight Regularization reduces overfitting by penalizing or adding a constraint to the
loss function.
🡪Regularization terms are constraints the optimization algorithm (like Stochastic
Gradient Descent) must adhere to when minimizing loss function apart from minimizing the
error between predicted value & actual value.

🡪The above two equations represent two types of weight regularization L1 & L2.
🡪There are two parts to the equation, the first part is the error between the actual target vs
the predicted target (loss function).
🡪 The second part is the weight penalty or the regularization term
🡪A regression model that uses L1 regularization technique is called Lasso Regression
and model which uses L2 is called Ridge Regression.
🡪The key difference between these two is the penalty term.
4. Dropouts

🡪Dropout is a regularization strategy that prevents deep neural networks from overfitting.

🡪deactivate a certain number of neurons at a layer from firing during training.


🡪At each iteration different set of neurons are deactivated & this results in a different set of
results.
🡪Many deep learning frameworks implement dropouts as a layer which receives inputs from the
previous layer, the dropout layer randomly selects neurons which are not fired to the next layer.
🡪By deactivating certain neurons which might contribute to overfitting the performance of the
network on test data improves.
🡪Dropouts reduce overfitting in a variety of problems like image classification, image
segmentation, word embedding etc.
5. Early Stopping
🡪While training a neural network using an optimization algorithm like Gradient Descent, the
model parameters (weights) are updated to reduce the training error.
🡪At the end of each forward propagation, the network parameters are updated to reduce error in
the next iteration.
🡪Too much training can result in network overfitting on the training data.
🡪Early stopping provides guidance as to how many iterations can be run before the network
begins to overfit.

The above graph indicates the point after which the network begins to overfit.
The network parameters at the point of early termination are the best fit for the model.
To decrease the test error beyond the point of early termination can be done by
🡪Decreasing the learning rate. Applying a learning rate scheduler algorithm would be
recommended.
🡪Applying a different optimization algorithm.
🡪Applying regularization.
6. Neural Architecture and Parameter Sharing
The most effective way of building a neural network is by constructing the architecture of
the neural network after giving some thought to the underlying data domain.
For example, the successive words in a sentence are often related to one another, whereas
the nearby pixels in an image are typically related.
These types of insights are used to create specialized architectures for text and image
data with fewer parameters.
Furthermore, many of the parameters might be shared. For example, a convolutional
neural network uses the same set of parameters to learn the characteristics of a local block of the
image.
7. Trading Off Breadth for Depth
🡪networks with more layers (i.e., greater depth) tend to require far fewer units per layer
because the composition functions created by successive layers make the neural network
more powerful.
🡪Increased depth is a form of regularization, as the features in later layers are forced to
obey a particular type of structure imposed by the earlier layers
🡪The number of units in each layer can typically be reduced to such an extent that a deep
network often has far fewer parameters even when added up over the greater
number of layers.
8. Ensemble Methods
🡪 A variety of ensemble methods like bagging are used in order to increase the
generalization power of the model.
🡪These methods are applicable not just to neural networks but to any pe of machine
learning algorithm.
🡪 However, in recent years, a number of ensemble methods that are specifically focused
on neural networks have also been proposed.
🡪Two such methods include Dropout and Dropconnect.
🡪These methods can be combined with many neural network architectures to obtain an
additional accuracy improvement of about 2% in many real settings.
🡪However, the precise improvement depends to the type of data and the nature of the
underlying training.

II.The Vanishing and Exploding Gradient Problems


🡪While increasing depth often reduces the number of parameters of the network, it leads
to different types of practical issues.
🡪Propagating backwards using the chain rule has its drawbacks in networks with a large
number of layers in terms of the stability of the updates.
🡪In particular, the updates in earlier layers can either be negligibly small (vanishing
gradient) or they can be increasingly large (exploding gradient) in certain types of neural
network architectures.
III. Difficulties in Convergence
🡪Sufficiently fast convergence of the optimization process is difficult to achieve with
very deep networks, as depth leads to increased resistance to the training process in terms
of
letting the gradients smoothly flow through the network.
This problem is somewhat related to the vanishing gradient problem, but has its own
unique characteristics.
IV. Local and Spurious Optima
When the parameter space is large, and there are many local optima, it makes sense to
spend some effort in picking good initialization points.
One such method for improving neural network initialization is referred to as pretraining.
The basic idea is to use either supervised or unsupervised training on shallow
sub-networks of the original network in order to create the initial weights.
This type of pretraining is done in a greedy and layerwise fashion in which a single layer
of the network is trained at one time in order to learn the initialization points of that layer.
This type of approach provides initialization points that ignore drastically irrelevant parts
of the parameter space to begin with.
Furthermore, unsupervised pretraining often tends to avoid problems associated with
overfitting
V Computational Challenges
A significant challenge in neural network design is the running time required to train the
network.
It is not uncommon to require weeks to train neural networks in the text and image
domains.
In recent years, advances in hardware technology such as Graphics Processor Units
(GPUs) have helped to a significant extent.
GPUs are specialized hardware processors that can significantly speed up the kinds of
operations commonly used in neural networks.
In this sense, some algorithmic frameworks like Torch are particularly convenient
because they have GPU support tightly integrated into the platform.

HYPERPARAMETERS AND VALIDATION SETS


• Hyperparameters in Machine learning are those parameters that are explicitly defined by
the user to control the learning process.
• These hyperparameters are used to improve the learning of the model, and their values
are set before starting the learning process of the model.
• Here the prefix "hyper" suggests that the parameters are top-level parameters that are
used in controlling the learning process.
• The value of the Hyperparameter is selected and set by the machine learning engineer
before the learning algorithm begins training the model.
• Hence, these are external to the model, and their values cannot be changed during the
training process.

Some examples of Hyperparameters in Machine Learning

o The k in kNN or K-Nearest Neighbour algorithm


o Learning rate for training a neural network
o Train-test split ratio
o Batch Size
o Number of Epochs
o Branches in Decision Tree
o Number of clusters in Clustering Algorithm

Model Parameters

o Model parameters are configuration variables that are internal to the model, and a model
learns them on its own. For example, W Weights or Coefficients of independent
variables in the Linear regression model. or Weights or Coefficients of independent
variables in SVM, weight, and biases of a neural network, cluster centroid in
clustering.

o They are used by the model for making predictions.


o They are learned by the model from the data itself
o These are usually not set manually.
o These are the part of the model and key to a machine learning Algorithm.

Model Hyperparameters:

Hyperparameters are those parameters that are explicitly defined by the user to control the
learning process. Some key points for model parameters are as follows:

o These are usually defined manually by the machine learning engineer.


o One cannot know the exact best value for hyperparameters for the given problem. The
best value can be determined either by the rule of thumb or by trial and error.
o Some examples of Hyperparameters are the learning rate for training a neural
network, K in the KNN algorithm
Categories of Hyperparameters

Broadly hyperparameters can be divided into two categories, which are given below:

1. Hyperparameter for Optimization


2. Hyperparameter for Specific Models

Hyperparameter for Optimization

The process of selecting the best hyperparameters to use is known as hyperparameter tuning, and
the tuning process is also known as hyperparameter optimization. Optimization parameters are
used for optimizing the model.

• Learning Rate: The learning rate is the hyperparameter in optimization algorithms that
controls how much the model needs to change in response to the estimated error for each
time when the model's weights are updated. It is one of the crucial parameters while
building a neural network, and also it determines the frequency of cross-checking with
model parameters. Selecting the optimized learning rate is a challenging task because if
the learning rate is very less, then it may slow down the training process. On the other
hand, if the learning rate is too large, then it may not optimize the model properly.
• Note: Learning rate is a crucial hyperparameter for optimizing the model, so if
there is a requirement of tuning only a single hyperparameter, it is suggested to
tune the learning rate.
• Batch Size: To enhance the speed of the learning process, the training set is divided into
different subsets, which are known as a batch.
• Number of Epochs: An epoch can be defined as the complete cycle for training the
machine learning model. Epoch represents an iterative learning process. The number of
epochs varies from model to model, and various models are created with more than one
epoch. To determine the right number of epochs, a validation error is taken into account.
• The number of epochs is increased until there is a reduction in a validation error.
If there is no improvement in reduction error for the consecutive epochs, then it
indicates to stop increasing the number of epochs.

Hyperparameter for Specific Models

Hyperparameters that are involved in the structure of the model are known as hyperparameters
for specific models. These are given below:

o A number of Hidden Units: Hidden units are part of neural networks, which refer to the
components comprising the layers of processors between input and output units in a
neural network.
It is important to specify the number of hidden units hyperparameter for the neural network. It
should be between the size of the input layer and the size of the output layer. More specifically,
the number of hidden units should be 2/3 of the size of the input layer, plus the size of the output
layer.

For complex functions, it is necessary to specify the number of hidden units, but it should not
overfit the model.

o Number of Layers: A neural network is made up of vertically arranged components,


which are called layers. There are mainly input layers, hidden layers, and output
layers. A 3-layered neural network gives a better performance than a 2-layered network.
For a Convolutional Neural network, a greater number of layers make a better model.

Validation Set

● Training set: The data you will use to train your model. This will be fed into an
algorithm that generates a model. It maps inputs to outputs.
● Validation set: This is smaller than the training set, and is used to evaluate the
performance of models with different hyperparameter values. It's also used to detect
overfitting during the training stages.
● Test set: This set is used to get an idea of the final performance of a model after
hyperparameter tuning. It's also useful to get an idea of how different models (SVMs,
Neural Networks, Random forests...) perform against each other.

🡪The validation and test sets are usually much smaller than the training set.

🡪The validation and test sets are put aside at the beginning of the project and are not
used for training.

The validation set is used to fine-tune the hyperparameters of the model and is considered a
part of the training of the model. The model only sees this data for evaluation but does not learn
from this data
Train vs. Validation vs. Test set
For training and testing purposes of our model, we should have our data broken down into three
distinct dataset splits.

The Training Set


It is the set of data that is used to train and make the model learn the hidden features/patterns in
the data.
In each epoch, the same training data is fed to the neural network architecture repeatedly, and the
model continues to learn the features of the data.
The training set should have a diversified set of inputs so that the model is trained in all scenarios
and can predict any unseen data sample that may appear in the future.

The Validation Set


The validation set is a set of data, separate from the training set, that is used to validate our
model performance during training.
This validation process gives information that helps us tune the model’s hyperparameters and
configurations accordingly. It is like a critic telling us whether the training is moving in the right
direction or not.
The model is trained on the training set, and, simultaneously, the model evaluation is performed
on the validation set after every epoch.
The main idea of splitting the dataset into a validation set is to prevent our model from
overfitting i.e., the model becomes really good at classifying the samples in the training set but
cannot generalize and make accurate classifications on the data it has not seen before.

The Test Set


The test set is a separate set of data used to test the model after completing the training.
It provides an unbiased final model performance metric in terms of accuracy, precision, etc. To
put it simply, it answers the question of "How well does the model perform?"
o

Estimators Bias and Variance

BIAS

Bias is simply defined as the inability of the model because of that there is some difference
or error occurring between the model’s predicted value and the actual value.

These differences between actual or expected values and the predicted values are known as
error or bias error or error due to bias.

Bias is a systematic error that occurs due to wrong assumptions in the machine learning
process.
Let Y be the true value of a parameter, and let Y’ be an estimator of Y based on a sample of
data. Then, the bias of the estimator Y’ is given by:

• Bias(Y’) = E(Y’) - Y
• where E(Y’) is the expected value of the estimator Y’. It is the measurement of the model
that how well it fits the data.

VARIANCE
Variance is the measure of spread in data from its mean position.
In machine learning variance is the amount by which the performance of a predictive model
changes when it is trained on different subsets of the training data.
More specifically, variance is the variability of the model that how much it is sensitive to
another subset of the training dataset. i.e. how much it can adjust on the new subset of the
training dataset.
Let Y be the actual values of the target variable, and Y’ be the predicted values of the
target variable.
Then the variance of a model can be measured as the expected value of the square of the
difference between predicted values and the expected value of the predicted values.
Variance = E[(Y’ - E[ Y’])^2]
Ways to Reduce the reduce Variance in Machine Learning:
Cross-validation: By splitting the data into training and testing sets multiple times,
cross-validation can help identify if a model is overfitting or underfitting and can be used to tune
hyperparameters to reduce variance.
Feature selection: By choosing the only relevant feature will decrease the model’s
complexity. and it can reduce the variance error.
Regularization: We can use L1 or L2 regularization to reduce variance in machine
learning models
Ensemble methods: It will combine multiple models to improve generalization
performance. Bagging, boosting, and stacking are common ensemble methods that can help
reduce variance and improve generalization performance.
Simplifying the model: Reducing the complexity of the model, such as decreasing the
number of parameters or layers in a neural network, can also help reduce variance and improve
generalization performance.
Early stopping: Early stopping is a technique used to prevent overfitting by stopping the
training of the deep learning model when the performance on the validation set stops improving.
Deep Learning
Deep learning is a method in artificial intelligence (AI) that teaches computers to process
data in a way that is inspired by the human brain. Deep learning models can recognize complex
patterns in pictures, text, sounds, and other data to produce accurate insights and predictions.

Machine Learning Deep Learning

Apply statistical algorithms to learn the Uses artificial neural network


hidden patterns and relationships in the architecture to learn the hidden patterns and
dataset. relationships in the dataset.

Requires the larger volume of dataset


Can work on the smaller amount of dataset
compared to machine learning

Better for complex task like image


Better for the low-label task. processing, natural language processing,
etc.

Takes less time to train the model. Takes more time to train the model.

A model is created by relevant features Relevant features are automatically


which are manually extracted from images to extracted from images. It is an end-to-end
detect an object in the image. learning process.

More complex, it works like the black


Less complex and easy to interpret the
box interpretations of the result are not
result.
easy.

It can work on the CPU or requires less


It requires a high-performance
computing power as compared to deep
computer with GPU.
learning.
🡪Deep learning is the branch of machine learning which is based on artificial neural network
architecture. An artificial neural network or ANN uses layers of interconnected nodes called
neurons that work together to process and learn from the input data.
🡪Deep Learning is a subfield of Machine Learning that involves the use of neural networks
to model and solve complex problems. Neural networks are modeled after the structure and
function of the human brain and consist of layers of interconnected nodes that process and
transform data.
🡪The key characteristic of Deep Learning is the use of deep neural networks, which have
multiple layers of interconnected nodes. These networks can learn complex representations of
data by discovering hierarchical patterns and features in the data. Deep Learning algorithms can
automatically learn and improve from data without the need for manual feature engineering.
🡪 Learning has achieved significant success in various fields, including image recognition,
natural language processing, speech recognition, and recommendation systems. Some of the
popular Deep Learning architectures include Convolutional Neural Networks (CNNs), Recurrent
Neural Networks (RNNs), and Deep Belief Networks (DBNs).
🡪 deep neural networks typically requires a large amount of data and computational
resources. However, the availability of cloud computing and the development of specialized
hardware, such as Graphics Processing Units (GPUs), has made it easier to train deep neural
networks.

Applications of deep learning


🡪Computer vision
🡪Reinforcement Learning
🡪NLP
Challenges in Deep Learning
🡪Data availability: It requires large amounts of data to learn from. For using deep learning
it’s a big concern to gather as much data for training.
🡪Computational Resources: For training the deep learning model, it is computationally
expensive because it requires specialized hardware like GPUs and TPUs.
🡪Time-consuming: While working on sequential data depending on the computational
resource it can take very large even in days or months.
🡪Interpretability: Deep learning models are complex.
🡪Overfitting: when the model is trained again and again, it becomes too specialized for the
training data, leading to overfitting and poor performance on new data.
Advantages of Deep Learning:
1. High accuracy: Deep Learning algorithms can achieve state-of-the-art performance in
various tasks, such as image recognition and natural language processing.
2. Automated feature engineering: Deep Learning algorithms can automatically discover
and learn relevant features from data without the need for manual feature engineering.
3. Scalability: Deep Learning models can scale to handle large and complex datasets, and
can learn from massive amounts of data.
4. Flexibility: Deep Learning models can be applied to a wide range of tasks and can handle
various types of data, such as images, text, and speech.
5. Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.
Disadvantages of Deep Learning:
1. High computational requirements: Deep Learning models require large amounts of data
and computational resources to train and optimize.
2. Requires large amounts of labeled data: Deep Learning models often require a large
amount of labeled data for training, which can be expensive and time- consuming to
acquire.
3. Interpretability: Deep Learning models can be challenging to interpret, making it difficult
to understand how they make decisions.
Overfitting: Deep Learning models can sometimes overfit to the training data, resulting in
poor performance on new and unseen data.
4. Black-box nature: Deep Learning models are often treated as black boxes, making it
difficult to understand how they work and how they arrived at their predictions.
In summary, while Deep Learning offers many advantages, including high accuracy and
scalability, it also has some disadvantages, such as high computational requirements, the
need for large amounts of labeled data, and interpretability challenges. These limitations
need to be carefully considered when deciding whether to use Deep Learning for a
specific task.

DEEP FEED FORWARD NETWORK (DFF)


Feed-Forward Neural Network is a single layer perceptron. A sequence of inputs enter the layer
and are multiplied by the weights in this model. The weighted input values are then summed
together to form a total. If the sum of the values is more than a predetermined threshold, which is
normally set at zero, the output value is usually 1, and if the sum is less than the threshold, the
output value is usually -1. The single-layer perceptron is a popular feed-forward neural network
model that is frequently used for classification. Single-layer perceptrons can also contain
machine learning features.

The neural network can compare the outputs of its nodes with the desired values using a property
known as the delta rule, allowing the network to alter its weights through training to create more
accurate output values. This training and learning procedure results in gradient descent. The
technique of updating weights in multi-layered perceptrons is virtually the same, however, the
process is referred to as back-propagation. In such circumstances, the output values provided by
the final layer are used to alter each hidden layer inside the network.
A Feed Forward Neural Network is an artificial neural network in which the connections
between nodes does not form a cycle. The feed forward model is the simplest form of neural
network as information is only processed in one direction. While the data may pass through
multiple hidden nodes, it always moves in one direction and never backwards.
The structure of a DFF is very similar to that of an FF. The major difference between them is the
number of hidden layers. Currently, people refer to a Neural Network with one hidden layer as a
“shallow” network or simply a Feed-Forward network.
Feedforward neural networks perform well when solving basic problems like identifying
simple patterns or classifying information. However, they will struggle with more complex tasks.
On the other hand, deep learning algorithms can process and analyze vast data volumes due to
several hidden layers of abstraction.

###########################################################################
##########################################################################

You might also like