Module I
Module I
DEEP LEARNING
Module 1
Neural Networks : Introduction to neural networks -Single layer
perceptrons, Multi Layer Perceptrons (MLPs), Representation
Power of MLPs, Activation functions - Sigmoid, Tanh, ReLU,
Softmax. ,Risk minimization, Loss function, Training MLPs with
backpropagation, Practical issues in neural network training - The
Problem of Overfitting, Vanishing and exploding gradient problems,
Difficulties in convergence, Local and spurious Optima,
Computational Challenges. Applications of neural networks.
Text Books
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT
Press, 2016.
2. Neural Networks and Deep Learning, Aggarwal, Charu C.
3. Fundamentals of Deep Learning: Designing Next-Generation
Machine Intelligence Algorithms (1st. ed.). Nikhil Buduma and
Nicholas Locascio. 2017. O'Reilly Media, Inc 3.
Introduction to neural networks
🌸 Artificial neural networks are popular machine learning
techniques that simulate the mechanism of learning in
biological organisms.
🌸 The human nervous system contains cells, which are referred
to as neurons.
🌸 The foundational unit of the human brain is the neuron.
🌸 The neurons are connected to one another with the use of
axons and dendrites, and the connecting regions between
axons and dendrites are referred to as synapses
🌸 An artificial neural network computes a function of the inputs
by propagating the computed values from the input neurons to
the output neuron(s) and using the weights as intermediate
parameters.
🌸 Learning occurs by changing the weights connecting the
neurons.
SINGLE COMPUTATIONAL LAYER: THE
PERCEPTRON
The simplest neural network is referred to as the perceptron.This
neural network contains a single input layer and an output node.
The basic architecture of a perceptron consists of the following components:
Input Layer:
☆ Accepts multiple input features (e.g., x1,x2,...,xn).
☆ Each input is associated with a weight (w1, w2,..., wn).
Weighted Sum:
Activation Function:
☆ Applies a step function or another activation function to decide the output: f(Net
Input).
☆ Typical step function: If Net Input>0 output is 1; otherwise, it's 0.
Output:
● w1 = 0.5
● w2 = 0.5
● b = -0.7
Activation Function:
0+1–1 = 0
•From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this row is incorrect.
•So we want values that will make inputs x1=0 and x2=1 give y` a value of 1. If we
change w2 to 2, we
have;
0+2–1 = 1
•From the Perceptron rule, this is correct for both the row 1 and 2.
Row 3
•Passing (x1=1 and x2=0), we get;
1+0–1 = 0
•From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this
row is incorrect.
•Since it is similar to that of row 2, we can just change w1 to 2, we
have;
2+0–1 = 1
Row 4
•Passing (x1=1 and x2=1), we get;
2+2–1 = 3
•Again, from the perceptron rule, this is still valid.
Therefore, we can conclude that the model to achieve an OR gate,
using the Perceptron algorithm is;
2x1+2x2–1
Where dooes single layer perceptrone fail……..
🌸 A Perceptron is often used to classify data into two parts.
🌸 A Perceptron is also known as a Linear Binary Classifier
🌸 In the single layer network, a set of inputs is directly mapped
to an output by using a generalized variation of a linear
function.
🌸 This type of model performs particularly well when the data is
linearly separable.
Linear Separability
OR XOR
XOR Gate
A⊕B=A
′
B+AB
′
(A+B)(AB)’
XY=(X1⋅0.3)+(X2⋅0.7)+(X3⋅0.5)+b
XY=(0.6⋅0.3)+(0.5⋅0.7)+(0.8⋅0.5)+0.1 =
1.03
Loss Function
🌸 A loss function is a mathematical function that measures how well a
model's predictions match the true outcomes.
🌸 The goal of a loss function is to guide optimization algorithms in
adjusting model parameters to reduce this loss over time.
🌸 Loss functions are crucial because:
☆ Guide Model Training: The loss function is the basis for the optimization process.
☆ Measure Performance: By quantifying the difference between predicted and actual
values, the loss function provides a benchmark for evaluating the model's
performance.
☆ Influence Learning Dynamics: The choice of loss function affects the learning
dynamics
🌸 The choice of the loss function is critical in defining the outputs
in a way that is sensitive to the application at hand.
🌸 For example, least-squares regression with numeric outputs
requires
🌸 a simple squared loss of the form (y − ŷ)2 for a single training
instance with target y and prediction ŷ.
🌸 For probabilistic predictions of categorical data, two types of
loss functions are used, depending on whether the prediction
is binary or whether it is multiway.
🌸 Binary Targets:
☆ Assumption: Observed value y∈{−1,+1}.
☆ Prediction (ŷ):
■ Uses a sigmoid activation function.
■ Outputs ŷ∈(0,1)
☆ Loss Function:
■ Negative logarithm of ∣y/2−0.5+ŷ∣ , Probability that the
prediction is correct.
🌸 Categorical Targets:
☆ if ŷ1 . . . ŷk are the probabilities of the k classes
Obtained using the softmax activation function.
■
☆ Ground-truth Class: r.
☆ loss function for a single instance is defined as: L = −log(ŷr )
☆ This type of loss function implements multinomial logistic
regression, and it is referred to as the cross-entropy loss.
🌸 Hinge Loss
☆ Hinge loss (also known as max-margin loss) is a loss function
commonly used for classification problems.
☆ The hinge loss function for a single training example is defined
as:
In forward propagation, the data flows from the input layer to the output layer, passing through any hidden layers. Each
neuron in the hidden layers processes the input as follows:
Weighted Sum: The neuron computes the weighted sum of the inputs:
Activation Function:The weighted sum z is passed through an activation function to introduce non-linearity.
Once the network generates an output, the next step is to calculate the loss using a loss function. In supervised learning,
this compares the predicted output to the actual label.
Step 3: Backpropagation
The goal of training an MLP is to minimize the loss function by adjusting the network’s weights and biases. This is
achieved through backpropagation:
Gradient Calculation: The gradients of the loss function with respect to each weight and bias are calculated using the
chain rule of calculus.
1. Error Propagation: The error is propagated back through the network, layer by layer.
Gradient Descent:The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss w=w–η⋅∂w/∂L
● Where:
○ w is the weight.
○ η is the learning rate.
○ ∂L/∂w is the gradient of the loss function with respect to the
weight.
Back Propagation Algorithm
Derivation of Gradients
Example- Training MLP with backpropagation
Assume that the neurons have a sigmoid activation function,
perform a forward pass and a backward pass on the network.
Assume that the actual output of y is 0.5 and the learning rate is 1.
Solution:
1. Forward Pass
The net input to a neuron is
calculated as
For H3:
Apply the sigmoid activation
function
For H3
For H4
Apply the sigmoid activation function to O5
Backward Pass
Calculate the error
The error is the difference between the actual output and the
predicted output:
E=Actual Output−Predicted Output
E=0.5−0.6903=−0.1903
Calculate the gradient for weights
connected to O5.
The gradient for a weight is:
For O5
Update weights for w35 and w45
Calculate the gradients for
weights connected to H3 and H4
For hidden layers, the gradients depend on the contribution of the error
propagated backward. For example:
Substitute Values:
δ3 = y3(1-y3)W35*δ5
0.755*(1-0.755)0.3*-0.0407 = -0.0023
Execute the backpropagation algorithm for 1 epoch
🌸 When the network tries to learn from a small dataset it will tend to have greater control over the dataset & will make
sure to satisfy all the data points exactly. So, the network is trying to memorize every single data point and failing to
capture the general trend from the training dataset.
● Problem:
○ If weights are initialized with very small values, it can lead to vanishing gradients, especially
in deep networks.
○ If weights are initialized with very large values, it can cause exploding gradients, leading to
instability.
● Impact:
○ Slow or no convergence during training.
○ The network may get stuck in suboptimal regions of the loss landscape.
● Solution:
○ Xavier Initialization: Used for activation functions like sigmoid or tanh. It ensures weights are
neither too small nor too large by considering the size of the previous layer.
○ He Initialization: Used for ReLU and its variants, as it adjusts for the non-linear behavior of
the activation function.
Difficulties in Convergence
Learning Rate Challenges
● Problem:
○ A high learning rate causes the optimizer to overshoot the optimal point, leading to
oscillations or divergence.
○ A low learning rate slows down training, making convergence take a long time.
● Impact:
○ Convergence might become unstable, or training may stagnate.
● Solution:
○ Use Learning Rate Scheduling: Gradually reduce the learning rate during training (e.g., Step
Decay, Exponential Decay, or Cosine Annealing).
○ Use Adaptive Optimizers:
■ Adam: Combines momentum and RMSProp to adjust learning rates dynamically.
■ RMSProp: Scales the learning rate for each parameter based on the magnitude of
recent gradients.
■ Adagrad: Adapts learning rates based on past updates.
Difficulties in Convergence
3. Vanishing and Exploding Gradients
● Problem:
○ Vanishing Gradients: Gradients become exceedingly small during
backpropagation, leading to minimal updates to weights in earlier layers.
○ Exploding Gradients: Gradients grow exponentially, causing numerical instability.
● Impact:
○ Training stagnates (in the case of vanishing gradients).
○ Loss function diverges (in the case of exploding gradients).
● Solution:
○ Use activation functions like ReLU (and its variants) instead of sigmoid or tanh.
○ Implement Gradient Clipping to cap gradients within a defined range.
○ Use Batch Normalization to normalize layer inputs and stabilize gradients.
Vanishing and exploding gradient problems
In the realm of deep learning, the optimization process plays a crucial
role in training neural networks. Gradient descent, a fundamental
optimization algorithm, can sometimes encounter two common issues:
vanishing gradients and exploding gradients.
Vanishing Gradient
As the back propagation algorithm advances downwards(or backward)
from the output layer towards the input layer, the gradients often get
smaller and smaller and approach zero, eventually leaving the weights of
the initial or lower layers nearly unchanged. As a result, the gradient
descent never converges to the optimum. This is known as the vanishing
gradients problem.
Vanishing and exploding gradient problems
Exploding Gradient
On the contrary, the gradients keep getting larger in some cases as the
backpropagation algorithm progresses. This, in turn, causes large weight
updates and causes the gradient descent to diverge. This is known as the
exploding gradient problem.
Vanishing and exploding gradient problems
🌸 For example, a sigmoid activation often encourages the vanishing gradient
problem, because its derivative is less than 0.25 at all values of its
argument), and is extremely small at saturation
🌸 A ReLU activation unit is known to be less likely to create a vanishing
gradient problem because its derivative is always 1 for positive values of the
argument the use of adaptive learning rates and conjugate gradient
methods can help in many cases.
🌸 a recent technique called batch normalization is helpful in addressing some
of these issues
🌸 Batch Norm is a normalization technique done between the layers of a
Neural Network instead of in the raw data.
🌸 It is done along mini-batches instead of the full data set. It serves to speed
up training
Local Optima
In optimization problems, an optimum is a best possible solution
according to a given criterion. Local optima are solutions that are
better than other solutions in the immediate vicinity but are not
necessarily the best overall solution, which is referred to as the
global optimum.
Local Optima
Local Optima
🌸 Definition: A local optimum is a point in the loss function where the gradient is zero, but the
value of the loss function is higher than the global minimum.
Key Features:
Example:
● Consider a quadratic-like loss surface with multiple dips. A local minimum is one of these
dips, while the global minimum is the deepest point.
Local Optima
Techniques to Avoid Local Optima:
Key Features:
● These optima are not necessarily local minima but represent suboptimal solutions in
terms of generalization.
● Commonly occur when the neural network overfits the training data.
● Spurious optima are more prevalent in larger, overparameterized networks.
Example:
● A spurious optimum occurs when a neural network perfectly classifies the training data
but performs poorly on test data because it has overfit irrelevant patterns.
Techniques to Avoid Spurious Optima:
1. Regularization:
○ Add constraints like L1 or L2 regularization to penalize overly complex solutions.
○ Use dropout to prevent overfitting.
2. Batch Normalization: Helps stabilize the learning process and avoids overly sharp
minima.
3. Cross-validation: Regularly evaluate the model on a validation set to ensure it
generalizes well.
4. Early Stopping: Stop training as soon as the performance on the validation set stops
improving.
5. Loss Function Design: Design smooth, well-behaved loss functions that reduce the
likelihood of spurious optima.
Comparing Local and Spurious Optima
Key Challenges and Solutions
Computational Challenges
A significant challenge in neural network design is the running
time required to train the network
In recent years, advances in hardware technology such as Graphics
Processor Units (GPUs) have helped to a significant extent. GPUs
are specialized hardware processors that can significantly speed up
the kinds of operations commonly used in neural networks.
Computational Challenges
In this sense, some algorithmic frameworks like Torch are particularly
convenient because they have GPU support tightly integrated into the
platform
One convenient property of the neural network models is that most of
the computational heavy lifting is front loaded during the training phase.
the prediction phase is often computationally efficient, because it
requires a small number of operations (depending on the number of
layers).
This is important because the prediction phase is often far more
time-critical compared to the training phase.
APPLICATIONS OF NEURAL NETWORKS
key sectors including finance, healthcare, and automotive They can be used for
image recognition, character recognition and stock market predictions,etc.
1. Facial Recognition
Facial Recognition Systems are serving as robust systems of surveillance.
Recognition Systems matches the human face and compares it with the digital
images.
They are used in offices for selective entries. The systems thus authenticate a
human face and match it up with the list of IDs that are present in its database.
Convolutional Neural Networks (CNN) are used for facial recognition and
image processing.
APPLICATIONS OF NEURAL NETWORKS
2. Stock Market Prediction
Investments are subject to market risks. To make a successful stock prediction in real
time a Multilayer Perceptron MLP (class of feedforward artificial intelligence
algorithm) is employed.
MLP comprises multiple layers of nodes, each of these layers is fully connected to the
succeeding nodes. Stock’s past performances, annual returns, and non profit ratios are
considered for building the MLP model.
3. Social Media
Artificial Neural Networks are used to study the behaviours of social media users. Data
shared everyday via virtual conversations is tacked up and analyzed for competitive
analysis.
APPLICATIONS OF NEURAL NETWORKS
🌸 Neural networks duplicate the behaviours of social media users. Post
analysis of individuals' behaviours via social media networks the data can
be linked to people’s spending habits. Multilayer Perceptron ANN is used
to mine data from social media application
4. Aerospace
🌸 Aerospace Engineering is an expansive term that covers developments in
spacecraft and aircraft.
🌸 Fault diagnosis, high performance auto piloting, securing the aircraft
control systems, and modeling key dynamic simulations are some of the
key areas that neural networks have taken over.
🌸 Time delay Neural networks can be employed for modelling non linear
time dynamic systems.
APPLICATIONS OF NEURAL NETWORKS
🌸 Time Delay Neural Networks are used for position independent feature
recognition. The algorithm thus built based on time delay neural networks
can recognize patterns.
5. Defence
🌸 Neural Networks also shape the defence operations of technologically
advanced countries. Neural networks are used in logistics, armed attack
analysis, and for object location. They are also used in air patrols,
maritime patrol, and for controlling automated drones.
🌸 Convolutional Neural Networks(CNN), are employed for determining the
presence of underwater mines. Underwater mines are the underpass that
serve as an illegal commute route between two countries
APPLICATIONS OF NEURAL NETWORKS
6. Healthcare
🌸 Modern day individuals are leveraging the advantages of technology in the
healthcare sector. Convolutional Neural Networks are actively employed in the
healthcare industry for X-ray detection, CT Scan and ultrasound.
🌸 As CNN is used in image processing, the medical imaging data retrieved from
aforementioned tests is analyzed and assessed based on neural network models.
Recurrent Neural Network (RNN) is also being employed for the development of
voice recognition systems.
7. Signature Verification and Handwriting Analysis
🌸 Signature Verification , as the self explanatory term goes, is used for verifying an
individual’s signature. Banks, and other financial institutions use signature
verification to cross check the identity of an individual.
APPLICATIONS OF NEURAL NETWORKS
6.Artificial Neural Networks are used for verifying th signatures. ANN are
trained to recognize the difference between real and forged signatures. ANNs
can be used for the verification of both offline and online signatures.
8. Weather Forecasting
🌸 Forecasting is primarily undertaken to anticipate the upcoming weather
conditions beforehand. In the modern era, weather forecasts are even
used to predict the possibilities of natural disasters.
🌸 Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and
Recurrent Neural Networks (RNN) are used for weather forecasting.
Traditional ANN multilayer models can also be used to predict climatic
conditions 15 days in advance. A combination of different types of neural
network architecture can be used to predict air temperatures.