0% found this document useful (0 votes)
24 views79 pages

Unit 2 Introduction to Deep Learning

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 79

Department of

Artificial Intelligence and Data Science


AL3502 - DEEP LEARNING FOR
VISION

Dr. Arthi. A
Professor
Department of Artificial Intelligence and
Data Science
Source:
1.Richard Szeliski, Computer Vision: Algorithms and Applications, 2010.
OBJECTIVES

2. To understand the methods and terminologies


involved in deep neural network
SYLLABUS
UNIT - II
INTRODUCTION TO DEEP LEARNING

Deep Feed-Forward Neural Networks – Gradient


Descent – Back-Propagation and Other Differentiation
Algorithms – Vanishing Gradient Problem – Mitigation
– Rectified Linear Unit (ReLU) – Heuristics for
Avoiding Bad Local Minima – Heuristics for Faster
Training – Nestors Accelerated Gradient Descent –
Regularization for Deep Learning – Dropout –
Adversarial Training – Optimization for Training Deep
Models.
Deep Feed Forword Networks
Deep Feed Forword Networks
Deep Feed Forword Networks
Activation Function
Deep Feed Forword Networks
1.Input Layer
• Image as input (grayscale or RGB)
• Pixels flattened into a 1D array (e.g., 28x28 = 784
pixels)
• Neurons in input layer correspond to each pixel
2.Hidden Layers
• Multiple hidden layers for feature extraction
• First layers detect basic features (e.g., edges)
• Deeper layers extract complex patterns (e.g.,
shapes, textures)
• Activation functions (ReLU, sigmoid) introduce non-
linearity
Deep Feed Forword Networks
3. Output Layer
• Neurons represent possible classes (e.g., "cat",
"dog")
• Softmax activation for classification
• Produces probabilities for each class
4. Training Process
• Forward pass: data flows through the network
• Loss calculation: error between predicted and true
labels
• Backpropagation: weights updated to minimize loss
• Optimization: uses gradient descent or Adam
Deep Feed Forword Networks
Training Phase
1. Initialize Weights and Biases
• Randomly initialize weights and biases
• Each neuron has associated weights and biases
2. Forward Pass
• Input image is flattened into a vector
Pass through each hidden layer (linear
transformation + activation)
• Hidden layers extract features (e.g., edges, shapes)
• Output layer produces probabilities for each class
(e.g., cat or dog)
Deep Feed Forword Networks
3. Loss Calculation
• Compare predicted output with true label
• Use cross-entropy loss for classification tasks
• Higher loss indicates a larger error
4. Backpropagation
• Compute the gradients of the loss w.r.t. weights
• Gradients propagate from the output layer back to
the input layer
• Determine how to adjust weights to reduce the loss
Deep Feed Forword Networks
5. New Weight Update
• Use optimization algorithms like Stochastic
Gradient Descent (SGD) or Adam
• Update weights based on gradients
• Learning rate controls the size of weight updates
6. Repeat for Multiple Epochs
• Process entire dataset through multiple epochs
• Use mini-batch gradient descent for faster updates
• Iterate until the model converges
Deep Feed Forword Networks
7. Monitoring Performance
• Track training and validation loss
• Use validation data to check generalization
performance
• Early stopping to prevent overfitting
8. Regularization
• Dropout: Randomly drop neurons to prevent
overfitting
• L2 Regularization: Add penalty for large weights to
simplify the model
Deep Feed Forword Networks
5. Inference (Prediction)
• New image input passed through the network
• Features extracted in hidden layers
• Output layer produces predicted label (highest
probability)
6. Challenges
• Overfitting: risk of fitting noise in data
• Regularization: techniques like dropout, L2
regularization
• Data Augmentation: enhances model robustness
with varied inputs
Deep Feed Forword Networks
7. Real-World Applications
• Image classification (e.g., cat vs dog, handwritten
digit recognition)
• Object detection, facial recognition, and medical
image analysis
Gradient based Optimization
• Most deep learning algorithms involve optimization
of some sort.
• Optimization refers to the task of either
minimizing or maximizing some function f (x) by
altering x.
Objective Function
• The function we want to minimize or maximize is
called the objective function or criterion.
• It quantifies how well the model's predictions
match the actual outcomes.
Gradient based Optimization
• We often denote the value that minimizes or
maximizes a function with a superscript ∗. For
example, we might say x∗ = arg min f(x).
• Most optimization problems are framed as
minimization problems.
• If a problem is about maximization, we can
convert it to a minimization problem by
minimizing the negative of the objective
function.
• When we are minimizing it, we may also call it the
cost function, loss function, or error function.
Gradient based Optimization
• Suppose we have a function y = f (x), where both x
and y are real numbers.
• The derivative of this function is denoted as f’(x)
or as dy/dx.
• The derivative f’(x) gives the slope of f (x) at the
point x.
• It shows the rate of change of the function's value
with respect to changes in 𝑥

• In other words, it specifies how to scale a small


change in the input in order to obtain the
corresponding change in the output.
Gradient based Optimization
• This is an iterative optimization technique where
we update the variable x in the direction opposite
to the gradient of the objective function.
• This helps in reducing the value of the function. The
update rule is
x x - α.f’(x)
where
α is a small step size or learning rate.
Gradient based Optimization
Figure
describes
an
illustration
of how the
derivatives
of a
function
can be
used to
follow the
function
downhill to
a
minimum.
Figure Uphill and the Groundhill of the gradient problem
Gradient based Optimization
• The derivative is therefore useful for minimizing a
function because it tells us how to change x in order
to make a small improvement in y.
• For example, we know that f(x-ϵsign(f’(x))) is less
than f (x) for small enough ϵ.
• We can thus reduce f (x) by moving x in small steps
with opposite sign of the derivative. (x) = 0, the
derivative provides no information about which
direction.
Gradient based Optimization
• When f’(x) the derivative provides no information
about which direction to move.
• Points where f’(x)=0 known as critical points or
stationary points.
• A local minimum is a point where f (x) is lower than
at all neighboring points, so it is no longer possible
to decrease f(x) by making infinitesimal steps.
• A local maximum is a point where f (x) is higher
than at all neighboring points,
Gradient based Optimization
Local Minimum:
• A point where the function value is lower than at all
neighboring points.
• It's a point where we can't decrease the function value
by making infinitesimal changes.
Local Maximum:
• A point where the function value is higher than at all
neighboring points.
• It's a point where we can't increase the function value
by making infinitesimal changes.
Saddle Point:
• A critical point that is neither a local minimum nor a
local maximum.
• The function might have a higher value in one
direction and a lower value in another direction,
resembling a saddle.
Gradient based Optimization
• A point that obtains the absolute lowest value of f (x)
is a global minimum.It is possible for there to be only
one global minimum or multiple global minima of the
function.
• It is also possible for there to be local minima that are
not globally optimal.
• In the context of deep learning, we optimize functions
that may have many local minima that are not optimal,
and many saddle points surrounded by very flat
regions.
• All of this makes optimization very difficult, especially
when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal
sense.
Gradient based Optimization

Figure representing Minimum ,maximum saddle Point


Gradient based Optimization
• A point that obtains the absolute lowest value of f (x)
is a global minimum.It is possible for there to be only
one global minimum or multiple global minima of the
function.
• It is also possible for there to be local minima that are
not globally optimal.
• In the context of deep learning, we optimize functions
that may have many local minima that are not optimal,
and many saddle points surrounded by very flat
regions.
• All of this makes optimization very difficult, especially
when the input to the function is multidimensional.
We therefore usually settle for finding a value of f that
is very low, but not necessarily minimal in any formal
sense.
Back-Propagation
• After a neural network is defined with initial weights,
and a forward pass is performed to generate the
initial prediction,
• there is an error function which defines how far
away the model is from the true prediction.
• There are many possible algorithms that can
minimize the error function—for example, one could
do a brute force search to find the weights that
generate the smallest error.
• However, for large neural networks, a training
algorithm is needed that is very computationally
efficient.
• Backpropagation is that algorithm—it can discover
the optimal weights relatively quickly, even for a
network with millions of weights.
Back-Propagation
Training algorithm of BPNN:
1. Inputs X, arrive through the pre connected path
2. Input is modeled using real weights W. The weights
are usually randomly selected.
3. Calculate the output for every neuron from the
input layer, to the hidden layers, to the output layer.
4. Calculate the error in the outputs
Error B= Actual Output – Desired Output
5. Travel back from the output layer to the hidden layer
to adjust the weights such that the error is decreased.
Keep repeating the process until the desired output is
achieved
Back-Propagation

Architecture of back propagation network:


As shown in the diagram, the architecture of BPN has
three interconnected layers having weights on them.
The hidden layer as well as the output layer also has
bias, whose weight is always 1, on them. As is clear from
the diagram, the working of BPN is in two phases. One
phase sends the signal from the input layer to the output
layer, and the other phase back propagates the error
from the output layer to the input layer.
Back-Propagation
1. Forward pass — weights are initialized and inputs
from the training set are fed intothe network. The
forward pass is carried out and the model generates
its initial prediction.
2. Error function — the error function is computed by
checking how far away the prediction is from the
known true value.
3. Backpropagation with gradient descent — the
backpropagation algorithm calculates
• how much the output values are affected by each of
the weights in the model. To do this,it calculates
partial derivatives, going back from the error
function to a specific neuron and its weight.
Back-Propagation
• This provides complete traceability from total errors,
back to a specific weight which contributed to that
error. The result of backpropagation is a set of
weights that minimize the error function.
4. Weight update — weights can be updated after every
sample in the training set, but this is
• usually not practical. Typically, a batch of samples is
run in one big forward pass, and then
backpropagation performed on the aggregate result.
• The batch size and number of batches used in
training, called iterations, are important
hyperparameters that are tuned to get the best
results. Running the entire training set through the
backpropagation process is called an epoch.
Vanishing Gradient
•  The Neural Networks are trained using back
propagation and gradient based learning methods.
•  During training, we want to reach the most
optimum value of weights resulting in minimum loss.
•  Each weight is constantly gets updated during the
training of the algorithm.
•  The update is proportional to the partial
derivative of the error function with respect to
the current weight in each training iteration.
•  However, sometimes this update becomes too small,
and hence the weight does not get updated.
• It results in very less or practically no training of the
network. This is referred to as the vanishing
gradient problem.
Vanishing Gradient
•  In Figure, we Shown that in the sigmoid function, we
can face the problem of vanishing gradient, while in
the case of a ReLU or Leaky ReLU, we will not have
vanishing gradient as an issue.
Back-Propagation
• The backpropagation algorithm is a fundamental
concept in training artificial neural networks,
including deep learning models.
• It is used to adjust the network's weights and
biases during the training process to minimize the
error between the predicted and actual outputs.
1.Forward Propagation:
• The process begins with forward propagation.
• Input data is passed through the neural network to
compute the predicted outputs.
• Each neuron in the network calculates a weighted
sum of its inputs and applies an activation function to
produce an output.
Back-Propagation

2.Loss Function:
• A loss function (also known as a cost function or
error function) is used to quantify the error between
the predicted outputs and the actual target values.
• Common loss functions include Mean Squared Error
(MSE) for regression tasks and cross-entropy for
classification tasks.
Back-Propagation

3.Backpropagation:
• The core of the backpropagation algorithm involves
calculating the gradients of the loss function with
respect to the network's parameters, primarily the
weights and biases.
• The gradients represent the sensitivity of the loss to
changes in the parameters. They indicate how much
the loss would change if the parameters were
adjusted.
Back-Propagation

4.Gradient Descent:
• The computed gradients are used to update the
network's weights and biases.
• A common optimization algorithm used with
backpropagation is gradient descent.
• Gradient descent adjusts the weights and biases in
the direction that reduces the loss, allowing the
network to learn from its mistakes.
Back-Propagation
5.Iterative Process:
• The forward propagation, loss calculation, gradient
computation, and weight updates are performed
iteratively for a specified number of epochs or until
convergence.
• During training, the network gradually improves its
ability to make accurate predictions and minimize
the loss.
6.Mini-Batches:
• To improve efficiency, training is often performed
using mini-batches of data rather than the entire
dataset. This approach reduces the computational
load and can lead to faster convergence.
Back-Propagation
7.Activation Functions:
• In deep learning, various activation functions are
used within neural network layers, such as ReLU
(Rectified Linear Unit), sigmoid, and tanh. These
functions introduce non-linearity, which is essential
for the network's ability to learn complex patterns.
8.Backpropagation Through Layers:
• Backpropagation works by computing gradients
layer by layer, starting from the output layer and
moving backward through the hidden layers.
• The chain rule from calculus is used to efficiently
calculate the gradients for each layer.
Back-Propagation
9.Regularization Techniques:
• To prevent overfitting, regularization techniques like
dropout and weight decay are often employed during
training.

• The backpropagation algorithm is a key component


of deep learning, enabling neural networks to learn
from data, make predictions, and adapt their
parameters to minimize errors. It has been
instrumental in the success of various deep learning
architectures, such as convolutional neural networks
(CNNs) for image processing and recurrent neural
networks (RNNs) for sequential data.
Back-Propagation
Back-Propagation
• Using the Back propagation network, find the new
weights for the net shown below. It is presented with
the input pattern [0,1] and the target output 1. Use a
learning rate of 0.25 and binary sigmoidal activation
function
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
Back-Propagation
CS 404/504, Fall 2021

Activation: ReLU
Introduction to Neural Networks

• ReLU (Rectified Linear Unit): takes a real-valued number and


thresholds it at zero
ℝ𝑛 → ℝ+ ¿ ¿
𝑛

 Most modern deep NNs use ReLU


activations 𝑓 (𝑥 )

 ReLU is fast to compute


o Compared to sigmoid, tanh
o Simply threshold a matrix at zero
 Accelerates the convergence of
gradient descent
o Due to linear, non-saturating form
𝑥
 Prevents the gradient vanishing
problem

67
A Heuristics for Avoiding Bad Local Minima

• Local minima in optimization are points where a


function's value is lower than neighboring points
but not necessarily the lowest across the entire
domain.
• In machine learning, especially when training deep
neural networks, models can sometimes get stuck in
these local minima, leading to suboptimal
performance.
A Heuristics for Avoiding Bad Local Minima
• Local To avoid bad local minima, various heuristics
and techniques are employed:
1. Random Initialization of Weights
2. Momentum-Based Gradient Descent
3. Stochastic Gradient Descent (SGD)
4. Regularization Techniques (L2, Dropout)
5. Ensemble Methods
6. Batch Normalization
7. Adaptive Optimization Algorithms (Adam, RMSprop)
8. Simulated Annealing
9. Learning Rate Annealing and Schedulers
A Heuristics for Avoiding Bad Local Minima

10.Noise Injection
11.Escape with Perturbation Methods
12.Using Over-Parameterized Networks

Avoiding bad local minima in deep learning is crucial to


achieving good performance. Techniques such as
random initialization, momentum, adaptive optimizers,
batch normalization, and noise injection provide
powerful tools to escape or mitigate the effects of local
minima during training. Each of these heuristics
addresses different aspects of the optimization process,
helping the model reach a better solution.
Heuristics for Faster Training
Use Pretrained Models
• Fine-tuning a model that’s already trained on a
similar task saves time instead of starting from
scratch.
Early Stopping
• Stop training when the performance stops improving
on the validation set to save time.
Weight Initialization
• Properly initialize weights (like He or Xavier
initialization) to avoid slow convergence from poor
starting points.
Heuristics for Faster Training
Batch Normalization
• Normalize inputs of each layer to stabilize and speed
up training by allowing higher learning rates.
Adaptive Learning Rate Optimizers (Adam,
RMSprop)
• Automatically adjust the learning rate during training
for faster convergence without needing to fine-tune
manually.
Learning Rate Scheduling
• Start with a high learning rate and reduce it as
training progresses to speed up the early stages and
fine-tune later.
Heuristics for Faster Training
Data Augmentation
• Generate more data by transforming existing
samples, improving model generalization and
reducing training time.
Parallel Computing (GPUs/TPUs)
• Use GPUs or TPUs to handle computations faster,
especially for large models and datasets.
Use Mini-Batch Gradient Descent
• Instead of using the entire dataset, update the model
with small batches of data to speed up training.
Heuristics for Faster Training
Generate Reduce Precision (Mixed Precision Training)
• Use lower-precision arithmetic (like 16-bit floating-
point) to speed up computation while still
maintaining accuracy.

• These heuristics help optimize training time while


maintaining or even improving model performance.
Regularization
• A central problem in machine learning is how to
make an algorithm that will perform well not just on
the training data, but also on new inputs. Many
strategies used in machine learning are explicitly
designed to reduce the test error, possibly at the
expense of increased training error. These strategies
are known collectively as regularization.
Regularization
Generalization Error
• Generalization error refers to the difference between
a machine learning model's performance on the
training data and its performance on new, unseen
data. It measures how well the model generalizes to
data it has not been trained on.
• A low generalization error indicates that the model is
not overfitting and performs well on both training
and test data.
• Regularization techniques are often used to reduce
generalization error by preventing overfitting and
improving the model’s ability to handle unseen data.
Regularization

• A Regularization in machine learning refers to


techniques that reduce a model's generalization
error without reducing its training error, often by
introducing constraints or penalties on the model's
parameters.
• These methods aim to prevent overfitting, especially
in complex models like deep learning networks,
where the model might otherwise learn irrelevant
details in the training data.
Regularization

• Regularization strategies can involve adding terms to


the objective function (like L1 or L2 penalties),
imposing soft constraints, or using ensemble
methods that combine multiple hypotheses.
• The goal is to strike a balance between bias and
variance, reducing the model's sensitivity to
fluctuations in the training data (variance) without
overly simplifying the model (increasing bias).
Regularization

• Regularization strategies can involve adding terms to


the objective function (like L1 or L2 penalties),
imposing soft constraints, or using ensemble
methods that combine multiple hypotheses.
• The goal is to strike a balance between bias and
variance, reducing the model's sensitivity to
fluctuations in the training data (variance) without
overly simplifying the model (increasing bias).

You might also like