0% found this document useful (0 votes)
10 views84 pages

Fundamentals of Neural Network

The document discusses the fundamentals of neural networks, focusing on key concepts such as forward propagation, loss functions, activation functions, backpropagation, and gradient descent. It explains how forward propagation processes input data through the network, the significance of loss functions in model training, and the role of activation functions in introducing non-linearity. Additionally, it covers the gradient descent optimization algorithm and its variants, which are essential for adjusting model parameters to minimize errors during training.

Uploaded by

henop47759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views84 pages

Fundamentals of Neural Network

The document discusses the fundamentals of neural networks, focusing on key concepts such as forward propagation, loss functions, activation functions, backpropagation, and gradient descent. It explains how forward propagation processes input data through the network, the significance of loss functions in model training, and the role of activation functions in introducing non-linearity. Additionally, it covers the gradient descent optimization algorithm and its variants, which are essential for adjusting model parameters to minimize errors during training.

Uploaded by

henop47759
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 84

Fundamentals of Neural Network

02

Training, Optimization and


Regularization of
Deep Neural Network
Forward propagation

Forward propagation
• Forward propagation in deep learning refers to the process of passing input
data through the neural network to get the output or prediction.
• It involves a series of computations, where the input data is transformed as
it passes through the layers of the network.
• Process of passing the input forward through the network, involving
weighted sums, biases, and activation functions, is forward propagation.
• The network learns the optimal weights and biases during the training phase
to make accurate predictions.
Forward propagation

Simple Analogy
• Preparing a recipe (making predictions)
• Ingredients (Input)
• Ingredients importance or preference (Weights and Biases)
• Mixing Ingredients (Weighted Sums)
• Taste recipe (Activation Function)
• Final Dish as per desire (Output ; 0/1)
Loss Function
Loss function
• In deep learning, a loss function is a measure of how well a model's
predictions match the actual target values.
• The goal during the training of a model is to minimize this loss function.
• It quantifies the difference between predicted values and true values,
providing a way to assess how well the model is performing.
• Different types of problems (classification, regression, etc.) and algorithms
use different loss functions.
Loss Function

Mean squared

Absolute
Loss Function
Squared Error Loss (Mean Squared Error - MSE):
• Use Case: Typically used for regression problems, where the goal is to
predict a continuous variable.
• Calculation: It calculates the average of the squared differences between
the predicted and actual values.
• Formula:
Squared Error Loss
(Mean Squared Error - MSE)
Squared Error Loss
(Mean Squared Error - MSE)
Squared Error Loss
(Mean Squared Error - MSE)
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Binary Cross Entropy Loss:
• Use Case: Commonly used for classification problems.
• Calculation: For binary classification problems (two classes).
• Formula:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Binary Cross Entropy Loss:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Binary Cross Entropy Loss:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Exercise:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Categorical Cross Entropy Loss:
• Use Case: Commonly used for classification problems.
• Calculation: For multi-class classification problems (more than two
classes).
• Formula
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Categorical Cross Entropy Loss:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Categorical Cross Entropy Loss:
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
Cross Entropy Loss
(Binary Cross Entropy and Categorical Cross Entropy)
What is activation function?
• The activation function decides whether a neuron should be activated or not
by calculating the weighted sum and further adding bias to it. The purpose
of the activation function is to introduce non-linearity into the output of a
neuron.
• In artificial neural networks, an activation function is one that outputs a
smaller value for tiny inputs and a higher value if its inputs are greater than
a threshold. An activation function "fires" if the inputs are big enough;
otherwise, nothing happens.
• An activation function, then, is a gate that verifies how an incoming value is
higher than a threshold value.
• The activation function is a fundamental component of neural networks that
introduces non-linearity, enabling them to learn complex relationships,
adapt to various data patterns, and make sophisticated decisions.
Why there is a need of activation function?
Introducing Non-linearity:
• Without activation functions, the entire neural network would behave like a linear
model.
• The stacking of multiple linear operations would result in a linear combination,
limiting the network's ability to learn and represent complex, non-linear patterns in
the data.

Capturing Complex Relationships:


• Many real-world problems involve intricate and non-linear relationships.
• Activation functions allow the neural network to model and capture these complex
patterns, making it more powerful in representing diverse data.

Enabling Neural Network to Learn:


• The non-linear transformations introduced by activation functions enable the
network to learn and adapt to intricate patterns in the input data during the training
process. This is crucial for the network to generalize well to unseen data.
Why there is a need of activation function?
Thresholding and Output Scaling:
• Activation functions often introduce thresholding effects, where the neuron
activates or not based on certain conditions.
• This helps in decision-making and provides a level of abstraction. Additionally,
activation functions like sigmoid and softmax scale the output to represent
probabilities in classification tasks.

Avoiding Vanishing or Exploding Gradients:


• Activation functions play a role in mitigating issues like vanishing or exploding
gradients during backpropagation, especially in deep neural networks.
• Well-designed activation functions help in the stable training of deep networks.
Why there is a need of activation function?
Introducing Sparsity:
• Some activation functions, like ReLU (Rectified Linear Unit) and its variants,
introduce sparsity in the network by setting negative values to zero.
• This can be beneficial in certain scenarios.

Facilitating Backpropagation:
• Activation functions provide derivatives or gradients that are essential for the
backpropagation algorithm, which is used to update the weights of the network
during training.
• This enables the network to learn and improve its performance over time.
Types of activation function
In a perceptron or a neural network, activation functions play a crucial role by
introducing non-linearity to the model.
Here are some common types of activation functions used in perceptrons:

1. Linear activation function


2. Logistic activation function
3. Tanh activation function
4. Softmax activation function
5. ReLU activation function
6. Leaky ReLU activation function
Types of activation function
1. Linear Function:
 Description: The linear activation function, also known as the identity
activation function, is a straightforward and simple function. It is defined as:
 Mathematical Form:

 Advantages:
 Simplicity
 Ease of Interpretation:
• Direct proportionality between input and output.
• Straightforward interpretation.
 Compatibility with Linear Models (Well-suited for tasks with linear
relationships)
Types of activation function
Disadvantages:
 Limited Expressiveness:
• Inability to model complex, non-linear relationships.
• Stacking linear layers results in a linear model.
 Vanishing Gradient Problem:
• Prone to vanishing gradients, especially in deep networks.
• May lead to slow learning.
 Not Suitable for Classification Problems:
• Challenging for binary classification tasks.
• Output not squashed into a specific range.
 Not Used in Hidden Layers of Deep Networks:
• Rarely used in deep networks' hidden layers.
• Non-linear activations preferred.
Types of activation function
2. Logistic Activation Function :
 is also commonly referred to as the Sigmoid Activation Function.
 Description: The sigmoid (logistic) function squashes input values to the range
(0, 1). It is commonly used in the output layer of binary classification models.
 Mathematical Form:
Types of activation function

 It is a function which is plotted as ‘S’ shaped graph.


 Nature : Non-linear. Notice that X values lies between -2 to 2, Y values are
very steep. This means, small changes in x would also bring about large
changes in the value of Y.
 Value Range : 0 to 1
 Uses : Usually used in output layer of a binary classification, where result is
either 0 or 1, as value for sigmoid function lies between 0 and 1 only so, result
can be predicted easily to be 1 if value is greater than 0.5 and 0 otherwise.
Types of activation function
Types of activation function
Types of activation function
Types of activation function
3. Tanh (Hyperbolic Tangent) Function:
 Description: Similar to the sigmoid function, the tanh function maps input
values to the range (-1, 1). It is often used in hidden layers of neural networks.
 Mathematical Form:
Types of activation function

 The activation that works almost always better than sigmoid function is Tanh
function also known as Tangent Hyperbolic function. It is actually
mathematically shifted version of the sigmoid function. Both are similar and
can be derived from each other.
 Value Range : -1 to +1
 Nature :- non-linear
 Uses :- Usually used in hidden layers of a neural network as it is values lies
between -1 to 1 hence the mean for the hidden layer comes out be 0 or very
close to it, hence helps in centering the data by bringing mean close to 0. This
makes learning for the next layer much easier.
Types of activation function
Types of activation function
Types of activation function
Types of activation function
Types of activation function
4. Softmax Function:
 Description: Often used in the output layer of a neural network for multi-class
classification problems. It transforms the raw output scores (logits) into a
probability distribution over multiple classes. The Softmax function is
particularly useful when dealing with problems where an input can belong to
one of several exclusive classes.
 Mathematical Form:
Types of activation function

 The softmax function is also a type of sigmoid function but is handy when we
are trying to handle multi- class classification problems.
 Nature :- non-linear
 Uses :- Usually used when trying to handle multiple classes. the softmax
function was commonly found in the output layer of image classification
problems.The softmax function would squeeze the outputs for each class
between 0 and 1 and would also divide by the sum of the outputs.
 If your output is for binary classification then, sigmoid function is very natural
choice for output layer.
 If your output is for multi-class classification then, Softmax is very useful to
predict the probabilities of each classes.
Types of activation function
Types of activation function
Types of activation function
Types of activation function
5. Rectified Linear Unit (ReLU):
 Description: ReLU is a popular activation function that outputs the input for
positive values and zero for negative values. It introduces non-linearity and is
computationally efficient.
 Mathematical Form:
Types of activation function

 It Stands for Rectified linear unit. It is the most widely used activation
function. Chiefly implemented in hidden layers of Neural network.
 Value Range :- [0, inf)
 Nature :- non-linear, which means we can easily backpropagate the errors and
have multiple layers of neurons being activated by the ReLU function.
 Uses :- ReLu is less computationally expensive than tanh and sigmoid because
it involves simpler mathematical operations. At a time only a few neurons are
activated making the network sparse making it efficient and easy for
computation.
Types of activation function
Types of activation function
Types of activation function
Types of activation function
Types of activation function
5. Leaky ReLU (Rectified Linear Unit) Function:
 Description: Leaky ReLU is an activation
function used in artificial neural networks to
introduce nonlinearity among the outputs
between layers of a neural network. This
activation function was created to solve the
dying ReLU problem using the standard ReLU
function that makes the neural network die
during training.
 Mathematical Form:
Types of activation function

 Using this function, we can convert negative values to make them close to
0 but not actually 0, solving the dying ReLU issue that arises from using
the standard ReLU function during neural network training.
 The Leaky ReLU is a popular activation function that is used to address
the limitations of the standard ReLU function in deep neural networks by
introducing a small negative slope for negative function inputs, which
helps neural networks to maintain better information flow both during its
training and after.
Types of activation function
In a perceptron or a neural network, activation functions play a crucial role by
introducing non-linearity to the model.
Here are some common types of activation functions used in perceptrons:

1. Linear activation function


2. Logistic activation function
3. Tanh activation function
4. Softmax activation function
5. ReLU activation function
6. Leaky ReLU activation function
Backpropagation
• Backpropagation is a supervised learning algorithm used to train artificial
neural networks.
• In backpropagation the neural networks adjust weights and biases to minimize
the error between predicted and actual outputs.
• Backpropagation aims to minimize the difference between predicted and
actual outputs.
• It uses the gradient of the error with respect to weights to iteratively adjust
weights for better predictions.
• During the forward pass, input data is fed through the neural network to
produce predictions.(the process of passing input data through the neural network to get the
prediction)
• The loss function quantifies the error between predicted and actual outputs.
• In backpropagation we try to minimize the loss to improve the model's
accuracy.
• Iterate through forward and backward passes until the model converges and
Set criteria for stopping iterations.
Gradient
Gradient
Gradient Descent
• Gradient Descent is an optimization algorithm used in machine learning and
deep learning for training models/Neural networks and finding the optimal
parameters that minimize a cost function.
Gradient Descent
• Step-by-step explanation of how Gradient Descent works:
 Step:1 Initialization: It begins by initializing the model parameters randomly
or with some predetermined values. These parameters could be the weights
and biases of a neural network, for example.
 Step:2 Compute Gradient: At each iteration, the algorithm computes the
gradient of the cost function with respect to each parameter. The gradient
represents the direction of the steepest ascent of the function.
 Step:3 Update Parameters: Once the gradient is computed, the algorithm
updates the parameters by moving them in the opposite direction of the
gradient. This is done to minimize the cost function. The update rule typically
involves multiplying the gradient by a learning rate parameter and subtracting
the result from the current parameter values.
 Step:4 Iterate: Steps 2 and 3 are repeated iteratively until a stopping criterion
is met. This could be a predefined number of iterations or until the change in
the cost function falls below a certain threshold.
Gradient Descent
• Example: Finding the Lowest Point in a Valley

Imagine you are blindfolded and placed somewhere in a valley. Your goal is
to find the lowest point in the valley without being able to see the
terrain(area of land). Here's how you might proceed:
 Initial Position: You start at a random location in the valley.
 Objective: Your objective is to descend to the lowest point in the valley.
 Sense of Touch: You can feel the slope of the ground beneath your feet,
giving you an indication of the direction of descent.
 Movement: You take a step in the direction of the steepest slope, relying
on your sense of touch to guide you downhill.
 Repetition: You repeat this process, continuously adjusting your direction
based on the slope of the terrain.
 Convergence: Eventually, you reach the lowest point in the valley,
indicating convergence to the optimal solution.
Gradient Descent
• In this analogy:

 The blindfolded person represents the optimization algorithm.


 The sense of touch represents the gradient, providing information about
the direction of descent.
 Moving downhill corresponds to updating the parameters in the direction
that minimizes the objective function.
 This example illustrates how Gradient Descent works by iteratively
adjusting parameters to minimize a cost function, much like finding the
lowest point in a valley by descending along the steepest slope.
Gradient Descent

There are three different variants of Gradient Descent


1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini-batch Gradient Descent
Batch Gradient Descent
• Batch Gradient Descent (BGD) is a variant of the Gradient Descent
optimization algorithm used to minimize a cost function in machine learning
and deep learning models.
• It's called "batch" because it processes the entire training dataset in each
iteration to update the model parameters.

 Optimization Algorithm in Machine/Deep Learning


 Variant of Gradient Descent
 Minimize a cost function in ML/DL models
 Processes entire training dataset in each iteration
Batch Gradient Descent
• Working Principle

 Initialization: Start with initial guess for model parameters


 Compute Gradient: Calculate gradient of cost function with respect to
parameters using entire training dataset
 Update Parameters: Adjust parameters using gradients and learning rate
 Repeat until convergence criteria met

• Key Aspects
 Uses entire dataset in each iteration so it is computationally expensive for
large datasets
 Accurate estimate of gradient
 Often converges to global minimum in most of the problems
Stochastic Gradient Descent

• SGD is a variant of the gradient descent optimization algorithm widely used


in machine learning and deep learning.
• Unlike batch gradient descent, which computes the gradient using the entire
dataset, SGD updates the model parameters using a single training example at
a time.
Stochastic Gradient Descent
Working Principle:

• Initialization: Start with initial guess for model parameters.


• For each epoch:
• Shuffle the training dataset.
• Iterate through each training example:
• Compute the gradient of the cost function with respect to the current
training example.
• Update the model parameters using the computed gradient and a
predefined learning rate.
• Repeat the process for a fixed number of epochs or until convergence criteria
are met.
Mini-batch Gradient Descent
• MBGD strikes a balance between the efficiency of stochastic gradient descent
(SGD) and the stability of batch gradient descent by updating the model
parameters using a small subset of the training data at each iteration.
• Instead of using the entire training dataset (as in batch gradient descent) or
just one example (as in SGD), MBGD divides the dataset into small batches
and updates the parameters based on the average gradient computed over each
batch.
Mini-batch Gradient Descent
Working Principle:

 Initialization: Start with an initial guess for the model parameters.


 Divide the training dataset into mini-batches of equal size (e.g., 32, 64, or 128
examples per batch).
 For each epoch:
• Shuffle the training dataset to introduce randomness and prevent the
model from getting stuck in local minima.
• Iterate through each mini-batch:
 Compute the gradient of the cost function with respect to the mini-
batch.
 Update the model parameters using the computed gradient and a
predefined learning rate.
 Repeat the process for a fixed number of epochs or until convergence criteria
are met.
Mini-batch Gradient Descent
Advantages:
• Offers a good compromise between the efficiency of SGD and the
stability of batch gradient descent.
• Well-suited for training on large datasets that do not fit into memory.
Limitation:
• Requires tuning of hyperparameters such as the learning rate and batch
size.
Advanced Optimizers
Optimizers :
• Optimizers adjust model parameters iteratively during training to
minimize a loss function, enabling neural networks to learn from data.
• Choosing an appropriate optimizer for a deep learning model is important
as it can greatly impact its performance. Optimization algorithms have
different strengths and weaknesses and are better suited for certain
problems and architectures.
• Some advanced optimizers used in neural networks:
1. SGD with Momentum
2. Nesterov Accelerated Gradient (NAG)
3. AdaGrad (Adaptive Gradient)
4. Gradient Descent with RMSprop(Root Mean Squared Propagation)
5. Adam (Adaptive Moment Estimation)
SGD with Momentum
• Momentum optimization is a popular variant of the gradient descent
optimization algorithm commonly used to train neural networks. It
addresses some of the limitations of basic gradient descent, particularly
slow convergence in the presence of flat or small gradients and
oscillations in the optimization process.

• In momentum optimization, instead of updating the weights based solely


on the current gradient, it also considers the accumulation of past
gradients to determine the direction of the update.
• This is achieved by introducing a new parameter called the momentum
parameter, denoted by β, which is typically set to a value between 0 and 1.
SGD with Momentum
SGD with Momentum
• The momentum term Vt is an exponentially weighted moving average of
past gradients(Exponentially Weighted Moving Average is a method for smoothing time-series data by assigning exponentially
decreasing weights to older observations. It is widely used for trend analysis, noise reduction, and forecasting in various fields.). It

accelerates the updates in directions where the gradients point consistently


over time and dampens oscillations in directions where the gradients
change direction frequently.
• By incorporating momentum, the optimizer gains inertia, enabling it to
continue moving in the same direction for a longer time and traverse
through regions of flat or small gradients more efficiently. This leads to
faster convergence and reduced oscillations during training.
• In short momentum optimization accelerates gradient descent by
introducing a momentum term that accumulates past gradients, helping
the optimizer navigate through complex optimization landscapes more
effectively. It is widely used in practice due to its ability to speed up
training and improve convergence for deep learning models.
Nesterov Accelerated Gradient (NAG)
• Nesterov Accelerated Gradient (NAG) is an optimization algorithm that
builds upon the momentum optimization method. It aims to improve upon
the original momentum approach by addressing the issue of momentum
overshooting, which can occur when the current gradient update is
combined with the accumulated momentum term.

• In Nesterov Accelerated Gradient, instead of evaluating the gradient at the


current position of the parameters, it evaluates the gradient at an adjusted
position that takes into account the momentum term. This adjustment is
made to anticipate the future position of the parameters based on the
momentum.
Nesterov Accelerated Gradient (NAG)
Nesterov Accelerated Gradient (NAG)
• The key difference between Nesterov Accelerated Gradient and traditional
momentum optimization is that the gradient is evaluated at the
"lookahead" position ( ) which anticipates the future position of
the parameters before updating them with the momentum term. This
allows NAG to correct the momentum overshooting problem by
incorporating a more accurate gradient estimate.

• By incorporating Nesterov momentum, the optimizer can adjust the


momentum term more effectively and reduce oscillations, leading to faster
convergence and improved optimization performance compared to
traditional momentum optimization.

You might also like