0% found this document useful (0 votes)
17 views

Module I

The document provides an introduction to neural networks, focusing on single-layer perceptrons and multi-layer perceptrons (MLPs), including their architecture, activation functions, and training methods. It discusses practical issues in training neural networks, such as overfitting and convergence challenges, and highlights the applications of neural networks. Additionally, it covers various activation functions like Sigmoid, Tanh, and ReLU, emphasizing their importance in enabling neural networks to learn complex patterns.

Uploaded by

devanand272003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Module I

The document provides an introduction to neural networks, focusing on single-layer perceptrons and multi-layer perceptrons (MLPs), including their architecture, activation functions, and training methods. It discusses practical issues in training neural networks, such as overfitting and convergence challenges, and highlights the applications of neural networks. Additionally, it covers various activation functions like Sigmoid, Tanh, and ReLU, emphasizing their importance in enabling neural networks to learn complex patterns.

Uploaded by

devanand272003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

CST414

DEEP LEARNING
Module 1
Neural Networks : Introduction to neural networks -Single layer
perceptrons, Multi Layer Perceptrons (MLPs), Representation
Power of MLPs, Activation functions - Sigmoid, Tanh, ReLU,
Softmax. ,Risk minimization, Loss function, Training MLPs with
backpropagation, Practical issues in neural network training - The
Problem of Overfitting, Vanishing and exploding gradient problems,
Difficulties in convergence, Local and spurious Optima,
Computational Challenges. Applications of neural networks.
Text Books
1. Goodfellow, I., Bengio,Y., and Courville, A., Deep Learning, MIT
Press, 2016.
2. Neural Networks and Deep Learning, Aggarwal, Charu C.
3. Fundamentals of Deep Learning: Designing Next-Generation
Machine Intelligence Algorithms (1st. ed.). Nikhil Buduma and
Nicholas Locascio. 2017. O'Reilly Media, Inc 3.
Introduction to neural networks
🌸 Artificial neural networks are popular machine learning
techniques that simulate the mechanism of learning in
biological organisms.
🌸 The human nervous system contains cells, which are referred
to as neurons.
🌸 The foundational unit of the human brain is the neuron.
🌸 The neurons are connected to one another with the use of
axons and dendrites, and the connecting regions between
axons and dendrites are referred to as synapses
🌸 An artificial neural network computes a function of the inputs
by propagating the computed values from the input neurons to
the output neuron(s) and using the weights as intermediate
parameters.
🌸 Learning occurs by changing the weights connecting the
neurons.
SINGLE COMPUTATIONAL LAYER: THE
PERCEPTRON
The simplest neural network is referred to as the perceptron.This
neural network contains a single input layer and an output node.
The basic architecture of a perceptron consists of the following components:

Input Layer:
☆ Accepts multiple input features (e.g., x1,x2,...,xn).
☆ Each input is associated with a weight (w1, w2,..., wn).
Weighted Sum:

☆ Calculates the weighted sum of the inputs:

Where b is the bias.

Activation Function:

☆ Applies a step function or another activation function to decide the output: f(Net
Input).
☆ Typical step function: If Net Input>0 output is 1; otherwise, it's 0.

Output:

☆ The final decision or prediction from the perceptron.


The Perceptron Algorithm
Example:
🌸 Imagine a perceptron in your brain deciding whether to attend a
concert.
🌸 Questions it considers:
☆ Is the artist good?
☆ Is the weather good?
🌸 Weights:
☆ Assign a weight for each factor based on importance.
■ Weight for the artist's quality: 0.7.
■ Weight for weather conditions: 0.3.

The perceptron uses these weights and inputs to calculate a decision.


Perceptron Example: Deciding Whether to Attend a Concert
Criteria Input Weight

Artists is Good x1 = 0 or 1 w1 = 0.7

Weather is Good x2 = 0 or 1 w2 = 0.6

Friend will Come x3 = 0 or 1 w3 = 0.5

Food is Served x4 = 0 or 1 w4 = 0.3

Alcohol is Served x5 = 0 or 1 w5 = 0.4


inputs(x1,x2,x3,x4,x5) = [1, 0, 1, 0, 1]
Weights(w1, w2, w3, w4, w5) = [0.7, 0.6, 0.5, 0.3, 0.4]
Threshold = 1.5

Multiply all inputs with its weights


x1 * w1 = 1 * 0.7 = 0.7
x2 * w2 = 0 * 0.6 = 0
x3 * w3 = 1 * 0.5 = 0.5
x4 * w4 = 0 * 0.3 = 0
x5 * w5 = 1 * 0.4 = 0.4
Sum all the results:
0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)
4. Activate the Output:
Return true if the sum > 1.5("Yes I will go to the Concert")
AND Gate Solution Using Single-Layer
Perceptron
Inputs (x1, x2):
🌸 Possible combinations: (0, 0), (0, 1), (1, 0), (1, 1)
Desired Output (y):
🌸 AND gate truth table
Assign weights and
bias:

● w1 = 0.5
● w2 = 0.5
● b = -0.7
Activation Function:

○ If Net Input >= 0:


Output = 1
○ If Net Input < 0:
Output = 0
Net Input Calculation
Formula: Net Input = (x1 * w1) + (x2 * w2) + b

Compute for each input:

1. (0, 0): (0 * 0.5) + (0 * 0.5) - 0.7 = -0.7


2. (0, 1): (0 * 0.5) + (1 * 0.5) - 0.7 = -0.2
3. (1, 0): (1 * 0.5) + (0 * 0.5) - 0.7 = -0.2
4. (1, 1): (1 * 0.5) + (1 * 0.5) - 0.7 = 0.3
Activation Function and Results
Apply activation function:
☆ If Net Input >= 0: Output = 1
☆ If Net Input < 0: Output = 0
Results:
🌸 (0, 0): Net Input = -0.7 → Output = 0
🌸 (0, 1): Net Input = -0.2 → Output = 0
🌸 (1, 0): Net Input = -0.2 → Output = 0
🌸 (1, 1): Net Input = 0.3 → Output = 1
OR Gate Solution Using Single-Layer Perceptron
Row 1 (x1=0, x2=0)
•From w1x1+w2x2+b, initializing w1, w2, as 1 and b as –1, we get;
x1(1)+x2(1)–1
•Passing the first row of the OR logic table (x1=0, x2=0), we get;
0+0–1 = –1
From the Perceptron rule, if Wx+b≤0, then y`=0. Therefore, this row is
correct.
Row 2

•Passing (x1=0 and x2=1), we get;

0+1–1 = 0

•From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this row is incorrect.

•So we want values that will make inputs x1=0 and x2=1 give y` a value of 1. If we
change w2 to 2, we

have;

0+2–1 = 1

•From the Perceptron rule, this is correct for both the row 1 and 2.
Row 3
•Passing (x1=1 and x2=0), we get;
1+0–1 = 0
•From the Perceptron rule, if Wx+b <= 0, then y`=0. Therefore, this
row is incorrect.
•Since it is similar to that of row 2, we can just change w1 to 2, we
have;
2+0–1 = 1
Row 4
•Passing (x1=1 and x2=1), we get;
2+2–1 = 3
•Again, from the perceptron rule, this is still valid.
Therefore, we can conclude that the model to achieve an OR gate,
using the Perceptron algorithm is;
2x1+2x2–1
Where dooes single layer perceptrone fail……..
🌸 A Perceptron is often used to classify data into two parts.
🌸 A Perceptron is also known as a Linear Binary Classifier
🌸 In the single layer network, a set of inputs is directly mapped
to an output by using a generalized variation of a linear
function.
🌸 This type of model performs particularly well when the data is
linearly separable.
Linear Separability

OR XOR
XOR Gate
A⊕B=A

B+AB

(A+B)(AB)’

we can say that the XOR gate consists of


an OR gate (x1 + x2), a NAND gate and
an AND gate
Multilayer Perceptron(MLP)
🌸 These networks contain more than one computational layer.
🌸 The input layer transmits data to the output layer.
🌸 Hidden Layers: In multilayer neural networks, there are
additional intermediate layers between the input and output
layers.
☆ These intermediate layers are known as hidden layers.
☆ The computations performed in hidden layers are not visible to
the user.
🌸 The model is associated with a directed acyclic graph describing
how the functions are composed together.
☆ For example, we might have three functions f (1) , f (2), and f (3)
connected in a chain, to form f (x) = f(3)(f (2)(f (1)(x))).
🌸 These chain structures are the most commonly used structures of
neural networks.
🌸 In this case, f (1) is called the first layer of the network, f (2) is
called the second layer, and so on.
🌸 multilayer neural networks is referred to as feed-forward networks
because successive layers feed into one another in the forward
direction from input to output.
🌸 All the learning of the weights is done automatically by the
backpropagation algorithm that uses dynamic programming to
work out the complicated parameter update steps of the
underlying computational graph
🌸 length of the chain gives the depth of the model. It is from this
terminology that the name “deep learning” arises.
🌸 The final layer of a feedforward network is called the output
layer.
🌸 A network with a single nonlinear hidden layer and a single linear
output layer can compute almost any "reasonable" function.
🌸 As a result, neural networks are often referred to as universal
function approximators.
🌸 The weights are learned by penalizing the differences between the
observed and predicted output for the i-th training instance using a
loss function Li.
🌸 The weights Wi of the neural network are then updated using gradient
descent to minimize each loss Li.
🌸 Training instances are fed to the neural network one by one to make
these updates.
Example: Learning XOR
Representation Power of MLPs
🌸 Representation power is related to ability of a neural network to
assign proper labels to a particular instance and create well defined
accurate decision boundaries for that class.
🌸 a neural network is a computational graph that performs
compositions of simpler functions to provide a more complex
function
🌸 power of deep learning arises from the fact that repeated
composition of multiple nonlinear functions has significant
expressive power.
🌸 the repeated composition of certain types of functions increases the
representation power of the network, and therefore reduces the
parameter space required for learning.
Activation functions
🌸 Activation functions enable the network to learn complex patterns
and map input data to outputs in a highly flexible way.
🌸 By introducing non-linearity, the network can represent a wide
range of functions.
🌸 making it more powerful for solving complex tasks.
🌸 An activation function Φ(v) in the output layer can control the
nature of the output.
🌸 In multilayer neural networks, activation functions bring
non-linearity into hidden layers, which increases the complexity of
the model.
🌸 A neural network with any number of layers but only linear
activations can be shown to be equivalent to a single-layer network.
Activation functions
🌸 Neural networks consist of neurons that operate using weights,
biases, and activation functions.
🌸 In the learning process, these weights and biases are updated
based on the error produced at the output—a process known
as backpropagation. Activation functions enable
backpropagation by providing gradients that are essential for
updating the weights and biases.
🌸 An activation function Φ(v) in the output layer can control the
nature of the output
🌸 The choice of activation function is a critical part of neural
network design.
🌸 Different types of nonlinear functions such as the sign, sigmoid,
or hyperbolic tangents are commonly used.
🌸 the notation .Φ to denote the activation function.
🌸 A single-layer network with column vector W of weights and
input (row) vector X would have a prediction of the following
form:
🌸 neuron really computes two functions within the node, which
is why we have incorporated the summation symbol Σ as well
as the activation symbol Φ within a neuron.
🌸 The break-up of the neuron computations into two separate
values is shown in Figure
Sigmoid Function
🌸 mathematically defined as
🌸 The output ranges between 0 and 1, hence
useful for binary classification and
probability based outputs.
🌸 The function exhibits a steep gradient when
x values are between -2 and 2. This
sensitivity means that small changes in
input x can cause significant changes in
output y, which is critical during the training
process.
Advantages of sigmoid
🌸 The main reason why we use sigmoid function is because it
exists between (0 to 1). Therefore, it is especially used for
models where we have to predict the probability as an
output.Since probability of anything exists only between the
range of 0 and 1.
🌸 In binary classification, also called logistic regression, the
sigmoid function is used to predict the probability of a binary
variable.
Disadvantages of the Sigmoid Activation Function

🌸 Problem of vanishing gradient.


🌸 It saturates and kills gradients.
🌸 It is computationally expensive because of the exponential
term in it.
Tanh Activation Function
🌸 Tanh function or hyperbolic tangent function, is a shifted
version of the sigmoid, allowing it to stretch across the y-axis.
It is defined as:
🌸
🌸
🌸 Alternatively, it can be expressed using the sigmoid function:
🌸
🌸
🌸 Value Range: Outputs values from -1 to +1.
🌸 Non-linear: Enables modeling of complex data patterns.
🌸 Use in Hidden Layers: Commonly used in hidden layers
ReLU (Rectified Linear Unit) Function
🌸 defined by Φ(v)=max(0,v), this means that if the input
vector x is positive, ReLU returns x, if the input is
negative, it returns 0.
🌸 Value Range: [0,∞), meaning the function only outputs
non-negative values.
🌸 Nature: It is a non-linear activation function, allowing
neural networks to learn complex patterns and making
backpropagation more efficient.
🌸 Advantage over other Activation: ReLU is less
computationally expensive than tanh and sigmoid
because it involves simpler mathematical operations. At
a time only a few neurons are activated making the
network sparse making it efficient and easy for
computation.
Advantages of ReLu
🌸 The model trained with ReLU converged quickly and thus
takes much less time when compared to models trained on
the Sigmoid function.
🌸 It doesn't allow for the activation of all of the neurons at
the same time. i.e., if any input is negative, ReLU converts
it to zero and doesn't allow the neuron to get activated.
🌸 This means that only a few neurons are activated, making
the network easy for computation.
Disadvantages of ReLu
🌸 The drawback of ReLU is that they cannot learn on
examples for which their activation is zero.
🌸 It usually happens if you initialize the entire neural
network with zero and place ReLU on the hidden
layers.
🌸 ReLu is faster to compute than the sigmoid function.
Softmax Function
🌸 designed to handle multi-class classification
problems.
🌸 It transforms raw output scores from a neural
network into probabilities.
🌸 It works by squashing the output values of each
class into the range of 0 to 1, while ensuring that
the sum of all probabilities equals 1.
🌸 Softmax is a non-linear activation function.
🌸 The Softmax function ensures that each class is
assigned a probability, helping to identify which
class the input belongs to.
Step 1: Compute the Weighted Sum

XY=(X1⋅0.3)+(X2⋅0.7)+(X3⋅0.5)+b

Substitute the given values:

XY=(0.6⋅0.3)+(0.5⋅0.7)+(0.8⋅0.5)+0.1 =
1.03
Loss Function
🌸 A loss function is a mathematical function that measures how well a
model's predictions match the true outcomes.
🌸 The goal of a loss function is to guide optimization algorithms in
adjusting model parameters to reduce this loss over time.
🌸 Loss functions are crucial because:
☆ Guide Model Training: The loss function is the basis for the optimization process.
☆ Measure Performance: By quantifying the difference between predicted and actual
values, the loss function provides a benchmark for evaluating the model's
performance.
☆ Influence Learning Dynamics: The choice of loss function affects the learning
dynamics
🌸 The choice of the loss function is critical in defining the outputs
in a way that is sensitive to the application at hand.
🌸 For example, least-squares regression with numeric outputs
requires
🌸 a simple squared loss of the form (y − ŷ)2 for a single training
instance with target y and prediction ŷ.
🌸 For probabilistic predictions of categorical data, two types of
loss functions are used, depending on whether the prediction
is binary or whether it is multiway.
🌸 Binary Targets:
☆ Assumption: Observed value y∈{−1,+1}.
☆ Prediction (ŷ):
■ Uses a sigmoid activation function.
■ Outputs ŷ∈(0,1)
☆ Loss Function:
■ Negative logarithm of ∣y/2−0.5+ŷ∣ , Probability that the
prediction is correct.
🌸 Categorical Targets:
☆ if ŷ1 . . . ŷk are the probabilities of the k classes
Obtained using the softmax activation function.

☆ Ground-truth Class: r.
☆ loss function for a single instance is defined as: L = −log(ŷr )
☆ This type of loss function implements multinomial logistic
regression, and it is referred to as the cross-entropy loss.
🌸 Hinge Loss
☆ Hinge loss (also known as max-margin loss) is a loss function
commonly used for classification problems.
☆ The hinge loss function for a single training example is defined
as:

☆ y is the true label of the example, taking values in {−1,+1} for


binary classification.
☆ f(x) is the raw output of the model (before applying an
activation like sigmoid), which is often referred to as the decision
value.
🌸 Advantages:
☆ Encourages larger margins between classes, which can lead to
better generalization.
☆ Works well with linear classifiers like SVM.
🌸 Disadvantages:
☆ For deep neural networks, cross-entropy loss is typically
preferred as it provides a smoother gradient for optimization,
especially with softmax outputs.
Risk minimization
🌸 Risk minimization in neural networks is a process that aims to
improve the performance of models by minimizing the
expected risk in learning tasks.
🌸 The goal of a machine learning algorithm is to reduce the
expected generalization error given by equation
🌸
🌸
🌸
🌸 This quantity is known as the risk.
Risk minimization
🌸 the expectation is taken over the true underlying distribution
Pdata.
🌸 convert a machine learning problem back into an optimization
problem is to minimize the expected loss on the training set.
🌸 This means replacing the true distribution p(x, y) with the
empirical distribution p^(x, y) defined by the training set.
🌸 We now minimize the empirical risk
Risk minimization
🌸 where m is the number of training examples
🌸 The training process based on minimizing this average training
error is known as empirical risk minimization.
🌸 empirical risk minimization is prone to overfitting
🌸 The most effective modern optimization algorithms are based
on gradient descent
🌸 In the context of deep learning, we rarely use empirical risk
minimization
TRAINING MLPS WITH BACKPROPAGATION
🌸 In the single-layer neural network, the training process is relatively
straightforward because the error (or loss function) can be computed as a
direct function of the weights,
🌸 In the case of multi-layer networks, the problem is that the loss is a
complicated composition function of the weights in earlier layers
🌸 The gradient of a composition function is computed using the
backpropagation algorithm.
🌸 The backpropagation algorithm leverages the chain rule of differential
calculus, which computes the error gradients in terms of summations of
local-gradient products over the various paths from a node to the output.
TRAINING MLPS WITH BACKPROPAGATION
🌸 The backpropagation algorithm is a direct application of dynamic programming.
🌸 It contains two main phases, referred to as the forward and backward phases,
respectively.
🌸 The forward phase is required to compute the output values and the local
derivatives at various nodes.
🌸 the backward phase is required to accumulate the products of these local values
over all paths from the node to the output
🌸 Forward phase:
☆ In this phase, the inputs for a training instance are fed into the neural network.
☆ This results in a forward cascade of computations across the layers, using the
current set of weights.
☆ The final predicted output can be compared to that of the training instance and the
derivative of the loss function with respect to the output is computed.
TRAINING MLPS WITH BACKPROPAGATION
🌸 The Backward phase:
☆ The main goal of the backward phase is to learn the gradient of
the loss function with respect to the different weights by using
the chain rule of differential calculus.
☆ These gradients are used to update the weights. Since these
gradients are learned in the backward direction, starting from
the output node
☆ this learning process is referred to as the backward phase
A single cycle through all the training points is referred to as an
epoch.
Step 1: Forward Propagation

In forward propagation, the data flows from the input layer to the output layer, passing through any hidden layers. Each
neuron in the hidden layers processes the input as follows:

Weighted Sum: The neuron computes the weighted sum of the inputs:

Activation Function:The weighted sum z is passed through an activation function to introduce non-linearity.

Step 2: Loss Function

Once the network generates an output, the next step is to calculate the loss using a loss function. In supervised learning,
this compares the predicted output to the actual label.

Step 3: Backpropagation

The goal of training an MLP is to minimize the loss function by adjusting the network’s weights and biases. This is
achieved through backpropagation:

Gradient Calculation: The gradients of the loss function with respect to each weight and bias are calculated using the
chain rule of calculus.

1. Error Propagation: The error is propagated back through the network, layer by layer.
Gradient Descent:The network updates the weights and biases by moving in the
opposite direction of the gradient to reduce the loss w=w–η⋅∂w/∂L

● Where:
○ w is the weight.
○ η is the learning rate.
○ ∂L/∂w is the gradient of the loss function with respect to the
weight.
Back Propagation Algorithm
Derivation of Gradients
Example- Training MLP with backpropagation
Assume that the neurons have a sigmoid activation function,
perform a forward pass and a backward pass on the network.
Assume that the actual output of y is 0.5 and the learning rate is 1.
Solution:
1. Forward Pass
The net input to a neuron is
calculated as
For H3:
Apply the sigmoid activation
function

For H3

For H4
Apply the sigmoid activation function to O5

Backward Pass
Calculate the error
The error is the difference between the actual output and the
predicted output:
E=Actual Output−Predicted Output
E=0.5−0.6903=−0.1903
Calculate the gradient for weights
connected to O5.
The gradient for a weight is:

For O5
Update weights for w35 and w45
Calculate the gradients for
weights connected to H3 and H4

For hidden layers, the gradients depend on the contribution of the error
propagated backward. For example:

Substitute Values:
δ3 = y3(1-y3)W35*δ5
0.755*(1-0.755)0.3*-0.0407 = -0.0023
Execute the backpropagation algorithm for 1 epoch

Reference video : https://www.youtube.com/watch?v=C3EQuy0jBGw


PRACTICAL ISSUES IN NEURAL NETWORK TRAINING

🌸 challenges are primarily related to several practical problems


associated with training, the most important one of which is
overfitting
overfitting means is that a neural network will provide excellent
prediction performance on the training data that it is built on, but
will perform poorly on unseen test instances.
The ability of a learner to provide useful predictions for instances it
has not seen before is referred to as generalization.
PRACTICAL ISSUES IN NEURAL NETWORK TRAINING

🌸 In Overfitting, the model tries to learn


too many details in the training data
along with the noise from the training
data. As a result, the model performance
is very poor on unseen or test datasets.
Therefore, the network fails to generalize
the features or patterns present in the
training dataset.
🌸 Overfitting during training can be
spotted when the error on training data
decreases to a very small value but the
error on the new data or test data
increases to a large value
PRACTICAL ISSUES IN NEURAL NETWORK TRAINING

🌸 This graph represents the error vs


iteration curve that shows how a deep
neural network overfits training data.
Here, the blue curve represents the
training error & the red curve represents
the testing error. The point where the
green line crosses is the point at which
the network begins to overfit. As you can
observe, the testing error increases
sharply while the training error
decreases.
PRACTICAL ISSUES IN NEURAL NETWORK TRAINING

Reasons for Overfitting


The size of the training dataset is small

🌸 When the network tries to learn from a small dataset it will tend to have greater control over the dataset & will make
sure to satisfy all the data points exactly. So, the network is trying to memorize every single data point and failing to
capture the general trend from the training dataset.

The model tries to make predictions on Noisy Data


Overfitting also occurs when the model tries to make predictions on data that is very noisy,
PRACTICAL ISSUES IN NEURAL NETWORK TRAINING

🌸 By removing some layers or reducing the number of neurons


the network becomes less prone to overfitting as the neurons
contributing to overfitting are removed or deactivated.
Therefore, the network has a smaller number of parameters to
learn because of which it cannot memorize all the data points
& will be forced to generalize.
Difficulties in Convergence
Convergence in neural network training refers to the process
where the model's parameters (weights and biases) stabilize, and
the loss function approaches a minimum value. However, several
challenges can make convergence difficult:
Difficulties in Convergence
Poor Weight Initialization

● Problem:
○ If weights are initialized with very small values, it can lead to vanishing gradients, especially
in deep networks.
○ If weights are initialized with very large values, it can cause exploding gradients, leading to
instability.
● Impact:
○ Slow or no convergence during training.
○ The network may get stuck in suboptimal regions of the loss landscape.
● Solution:
○ Xavier Initialization: Used for activation functions like sigmoid or tanh. It ensures weights are
neither too small nor too large by considering the size of the previous layer.
○ He Initialization: Used for ReLU and its variants, as it adjusts for the non-linear behavior of
the activation function.
Difficulties in Convergence
Learning Rate Challenges

● Problem:
○ A high learning rate causes the optimizer to overshoot the optimal point, leading to
oscillations or divergence.
○ A low learning rate slows down training, making convergence take a long time.
● Impact:
○ Convergence might become unstable, or training may stagnate.
● Solution:
○ Use Learning Rate Scheduling: Gradually reduce the learning rate during training (e.g., Step
Decay, Exponential Decay, or Cosine Annealing).
○ Use Adaptive Optimizers:
■ Adam: Combines momentum and RMSProp to adjust learning rates dynamically.
■ RMSProp: Scales the learning rate for each parameter based on the magnitude of
recent gradients.
■ Adagrad: Adapts learning rates based on past updates.
Difficulties in Convergence
3. Vanishing and Exploding Gradients

● Problem:
○ Vanishing Gradients: Gradients become exceedingly small during
backpropagation, leading to minimal updates to weights in earlier layers.
○ Exploding Gradients: Gradients grow exponentially, causing numerical instability.
● Impact:
○ Training stagnates (in the case of vanishing gradients).
○ Loss function diverges (in the case of exploding gradients).
● Solution:
○ Use activation functions like ReLU (and its variants) instead of sigmoid or tanh.
○ Implement Gradient Clipping to cap gradients within a defined range.
○ Use Batch Normalization to normalize layer inputs and stabilize gradients.
Vanishing and exploding gradient problems
In the realm of deep learning, the optimization process plays a crucial
role in training neural networks. Gradient descent, a fundamental
optimization algorithm, can sometimes encounter two common issues:
vanishing gradients and exploding gradients.
Vanishing Gradient
As the back propagation algorithm advances downwards(or backward)
from the output layer towards the input layer, the gradients often get
smaller and smaller and approach zero, eventually leaving the weights of
the initial or lower layers nearly unchanged. As a result, the gradient
descent never converges to the optimum. This is known as the vanishing
gradients problem.
Vanishing and exploding gradient problems
Exploding Gradient
On the contrary, the gradients keep getting larger in some cases as the
backpropagation algorithm progresses. This, in turn, causes large weight
updates and causes the gradient descent to diverge. This is known as the
exploding gradient problem.
Vanishing and exploding gradient problems
🌸 For example, a sigmoid activation often encourages the vanishing gradient
problem, because its derivative is less than 0.25 at all values of its
argument), and is extremely small at saturation
🌸 A ReLU activation unit is known to be less likely to create a vanishing
gradient problem because its derivative is always 1 for positive values of the
argument the use of adaptive learning rates and conjugate gradient
methods can help in many cases.
🌸 a recent technique called batch normalization is helpful in addressing some
of these issues
🌸 Batch Norm is a normalization technique done between the layers of a
Neural Network instead of in the raw data.
🌸 It is done along mini-batches instead of the full data set. It serves to speed
up training
Local Optima
In optimization problems, an optimum is a best possible solution
according to a given criterion. Local optima are solutions that are
better than other solutions in the immediate vicinity but are not
necessarily the best overall solution, which is referred to as the
global optimum.
Local Optima
Local Optima

🌸 Definition: A local optimum is a point in the loss function where the gradient is zero, but the
value of the loss function is higher than the global minimum.

Key Features:

● Found in non-convex optimization problems.


● The optimizer (e.g., Gradient Descent) might get stuck at this point, unable to find a better
solution.
● Local optima are more likely in shallow neural networks or networks with simpler structures.

Example:

● Consider a quadratic-like loss surface with multiple dips. A local minimum is one of these
dips, while the global minimum is the deepest point.
Local Optima
Techniques to Avoid Local Optima:

1. Momentum-based optimization: Helps the optimizer escape small local


optima by adding inertia to the gradient updates (e.g., SGD with
momentum).
2. Adaptive optimizers: Optimizers like Adam or RMSProp adjust the
learning rate dynamically.
3. Re-initialization: Restart the training with a new set of weights.
4. Stochasticity in SGD: Stochastic Gradient Descent (SGD) introduces
randomness, which can help escape local optima.
Spurious Optima
● Definition: A spurious optimum refers to a solution that appears to be optimal within
a specific region of the optimization landscape but does not generalize well to unseen
data.

Key Features:

● These optima are not necessarily local minima but represent suboptimal solutions in
terms of generalization.
● Commonly occur when the neural network overfits the training data.
● Spurious optima are more prevalent in larger, overparameterized networks.

Example:

● A spurious optimum occurs when a neural network perfectly classifies the training data
but performs poorly on test data because it has overfit irrelevant patterns.
Techniques to Avoid Spurious Optima:

1. Regularization:
○ Add constraints like L1 or L2 regularization to penalize overly complex solutions.
○ Use dropout to prevent overfitting.
2. Batch Normalization: Helps stabilize the learning process and avoids overly sharp
minima.
3. Cross-validation: Regularly evaluate the model on a validation set to ensure it
generalizes well.
4. Early Stopping: Stop training as soon as the performance on the validation set stops
improving.
5. Loss Function Design: Design smooth, well-behaved loss functions that reduce the
likelihood of spurious optima.
Comparing Local and Spurious Optima
Key Challenges and Solutions
Computational Challenges
A significant challenge in neural network design is the running
time required to train the network
In recent years, advances in hardware technology such as Graphics
Processor Units (GPUs) have helped to a significant extent. GPUs
are specialized hardware processors that can significantly speed up
the kinds of operations commonly used in neural networks.
Computational Challenges
In this sense, some algorithmic frameworks like Torch are particularly
convenient because they have GPU support tightly integrated into the
platform
One convenient property of the neural network models is that most of
the computational heavy lifting is front loaded during the training phase.
the prediction phase is often computationally efficient, because it
requires a small number of operations (depending on the number of
layers).
This is important because the prediction phase is often far more
time-critical compared to the training phase.
APPLICATIONS OF NEURAL NETWORKS
key sectors including finance, healthcare, and automotive They can be used for
image recognition, character recognition and stock market predictions,etc.
1. Facial Recognition
Facial Recognition Systems are serving as robust systems of surveillance.
Recognition Systems matches the human face and compares it with the digital
images.
They are used in offices for selective entries. The systems thus authenticate a
human face and match it up with the list of IDs that are present in its database.
Convolutional Neural Networks (CNN) are used for facial recognition and
image processing.
APPLICATIONS OF NEURAL NETWORKS
2. Stock Market Prediction
Investments are subject to market risks. To make a successful stock prediction in real
time a Multilayer Perceptron MLP (class of feedforward artificial intelligence
algorithm) is employed.
MLP comprises multiple layers of nodes, each of these layers is fully connected to the
succeeding nodes. Stock’s past performances, annual returns, and non profit ratios are
considered for building the MLP model.
3. Social Media
Artificial Neural Networks are used to study the behaviours of social media users. Data
shared everyday via virtual conversations is tacked up and analyzed for competitive
analysis.
APPLICATIONS OF NEURAL NETWORKS
🌸 Neural networks duplicate the behaviours of social media users. Post
analysis of individuals' behaviours via social media networks the data can
be linked to people’s spending habits. Multilayer Perceptron ANN is used
to mine data from social media application
4. Aerospace
🌸 Aerospace Engineering is an expansive term that covers developments in
spacecraft and aircraft.
🌸 Fault diagnosis, high performance auto piloting, securing the aircraft
control systems, and modeling key dynamic simulations are some of the
key areas that neural networks have taken over.
🌸 Time delay Neural networks can be employed for modelling non linear
time dynamic systems.
APPLICATIONS OF NEURAL NETWORKS
🌸 Time Delay Neural Networks are used for position independent feature
recognition. The algorithm thus built based on time delay neural networks
can recognize patterns.
5. Defence
🌸 Neural Networks also shape the defence operations of technologically
advanced countries. Neural networks are used in logistics, armed attack
analysis, and for object location. They are also used in air patrols,
maritime patrol, and for controlling automated drones.
🌸 Convolutional Neural Networks(CNN), are employed for determining the
presence of underwater mines. Underwater mines are the underpass that
serve as an illegal commute route between two countries
APPLICATIONS OF NEURAL NETWORKS
6. Healthcare
🌸 Modern day individuals are leveraging the advantages of technology in the
healthcare sector. Convolutional Neural Networks are actively employed in the
healthcare industry for X-ray detection, CT Scan and ultrasound.
🌸 As CNN is used in image processing, the medical imaging data retrieved from
aforementioned tests is analyzed and assessed based on neural network models.
Recurrent Neural Network (RNN) is also being employed for the development of
voice recognition systems.
7. Signature Verification and Handwriting Analysis
🌸 Signature Verification , as the self explanatory term goes, is used for verifying an
individual’s signature. Banks, and other financial institutions use signature
verification to cross check the identity of an individual.
APPLICATIONS OF NEURAL NETWORKS
6.Artificial Neural Networks are used for verifying th signatures. ANN are
trained to recognize the difference between real and forged signatures. ANNs
can be used for the verification of both offline and online signatures.
8. Weather Forecasting
🌸 Forecasting is primarily undertaken to anticipate the upcoming weather
conditions beforehand. In the modern era, weather forecasts are even
used to predict the possibilities of natural disasters.
🌸 Multilayer Perceptron (MLP), Convolutional Neural Network (CNN) and
Recurrent Neural Networks (RNN) are used for weather forecasting.
Traditional ANN multilayer models can also be used to predict climatic
conditions 15 days in advance. A combination of different types of neural
network architecture can be used to predict air temperatures.

You might also like