0% found this document useful (0 votes)

36 views44 pages

Deep MLP's

1. Modern deep learning techniques like dropout layers and regularization help address the problem of overfitting by introducing randomness. 2. Rectified linear unit (ReLU) activation functions help reduce the problem of vanishing gradients in deep networks compared to sigmoid and tanh activations. 3. Batch normalization helps address internal covariate shift by normalizing layer inputs both during training and inference, allowing deeper networks and faster convergence.

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views44 pages

Deep MLP's

Uploaded by

prabhakaran sridharan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Deep Multi-layer perceptrons.

1. The biggest problem is the vanishing gradient problem to train deep networks.
2. We have little data to train the networks and It is easily can over fit the network.
3. We have too little compute power. We take lot of time to work.
In 1980:

By the time we reach the 2010, lots of labelled data is generated by internet company.

In 2010:

Modern DL:
I also respect theory.
Dropout layers and regularization:

We have many weights so as overfitting, we use the dropout layers and regularization.

When we built random forest, we use the trees to look at the small parts of the data and fully
grown and also overfit the models.
By using the randomization as the regularization. Using the random forest. This reduces the
variance in the model.

The core idea is we are using the randomization of features to enable regularization.
Can we take the randomization for regularization for MLPs ?
Drop out layers:
Remove the neurons randomly between the Input and output layer. For only each iteration.
It is the probability rate lies between 0 and 1 these are called as the “p” value.
“P” value is the percentage the network will be removed. This is very similar to random subset of
features in random forest.

Drop out is very similar to random subset of features in random forest. As the inputs to the one
neuron is varied at every iteration.
Drop out as having the value that makes the network deactivate. To create regularization.
At test time,

At test time, the network remains the same and the each weight of the network is multiplied with
the “P” value.
When given a query point, if multiply with the weight value and “P” value.

If we have more weights than the number of data points then there are high chances for
overfitting the less is the value of P may be like 0.1 or 0.2.

This is a hyper parameters can be determined by grid search and all other networks.

If there are many weights more than number of data points then we keep “P” value is smaller.
Here “P” is the hyper parameter.

Fully connected Multilayer perceptron.

The dropout layer is applied after the layer in the neural network. The dropout layer randomly
sends the data from the layer to the next layer.
People call this as the dropout network.

Rectified Linear Units(ReLU):

One of the problem is the vanishing gradient problem which is very often seen when we have
the sigmoid and tanh activation and the convergence also slows down.

They are the default activation functions that are implemented in many networks.

ReLU is the best activation function.

The slope of the activation function is always equal to 1 or 0. Just like the Hinge function in
SVM.
The derivative at ZERO is not defined.

We have ReLU to overcome the vanishing gradient problem.

Because of Zero we have the problem of dead activations.

The solid line represents the model trained using ReLU. and dotted is the TanH.
Because we does not vanishing gradient or exploding gradient problem. So, the network
converges faster than the TanH.
The smooth approximation to ReLU. Such a function is called the softplus function.

The derivative of the softplus function is the logistic function. It is not much widely used.
Computing the derivative is also much simpler.

Noisy ReLU’s - This is not the most used one in MLP’s.

Leaky ReLU’s:

If the Z is negative then the derivative also become zero, this makes the chain rule also zero.

The weights are not changing anymore, when the weights are negative. Which we don’t need.
This is called the dead activation state.
The input to NN is always normalized.

The fix for this problem is giving a small value to the negative input to the neurons in the NN.

Typically people use the ReLU, If we found the more dead neurons in the NN then we tend to
use the Leaky ReLU.

Advantages of ReLu:
Weight Initialization:
For logistic regression the weights in the SGD, we initialize the weights randomly. We will
initialize the weights from the gaussian normal distribution.

What happens ?

We want asymmetry in the NN. That makes the each layer in the NN to learn different things.

In ensembles the more different the base models are, the better is the output of the model.
If we have the same weights we learn the same. This we don’t need.

2. If we have negative values then there is the problem of dead neurons in Case of ReLU and
other activation functions.

The data must be NORMALIZED for DNN.

Solution for the above cases:

Idea 1: Weights from the normal distribution.

Can we come up with better Initialization strategies ?

Fan IN: The number of inputs to the neuron.

Fan Out: The output from that neuron.

First technique:
Uniform initialization:
The weights from the uniform distribution with fanIN and fanOut as the function.

Idea - 2:
This also works fairly well for sigmoid.

There is no concrete agreement among all the researches.

Idea - 3:
Xavier/Glorot initialization:
There are two variations in Xavier initialization.
The weights are picked from the normal distributions and normal distribution.
Idea - 4:
He - initialization:
This also have the normal and uniform distribution. It works with ReLU and Leaky ReLU.

Batch Normalization:
The pre-processing steps in the NN is the data normalization.
If the input changes slightly a small change in the input can give the large change.

Example:

Between each batch there is less difference as the data is normalized the lower layers
does not get affected so much,
But the layers deeper in the network will be affected more.

Ideally we want the data to be normalized for every layer, for each layer there will be different
distribution.
The neurons at each layer can go Crazy. This problem is called the internal co-variance shift.

The solution for this is adding a new layer called Batch Normalization Layer, whenever we get a
batch of inputs. We normalize only that batch.
We are explicitly normalizing for each layer. It works deep in the layers.

The batch norm is in between the two layers. It has two hyper parameters <gamma> and
<delta>.
We will learn these parameters as the part of back propagation.

The major advantages are

1. Makes faster convergence.

2. We can afford to have larger learning rate as the data is from the same distribution.

BN is also acts as the regularization. We can train deeper neural networks with the BN.

In the case of deep learning, the batch and mini batch SGD do not work very well.
These are primarily for Deep Learning.

Global minima, Local minima:

If we recall we keep updating the weight until the derivative becomes zero. This occurs at
Minimum, Maximum and saddle point.

We will use the other optimizers that get rid of the zero derivative.

Convex functions and non convex functions:

Logistic regression and Linear Regression all of them we have can be seen as he Convex
function.
Hence for all of them the local min is the global minima. The loss function for MLP is Non-
convex loss function. Which means the local minima and saddle points.

Because it is a non convex function, based on the initial weight we can land up a different
minima.

For 3D and contours:

We take all the points at the same height are on the same line. These are called the contours
points.
Saddle point contour plot:

Simple mini batch SGD gets stuck at the N-dimensional same.

Given this update equation, the hardest part is computing the derivative.

THe SGD is the approx and erroneous.

Gradient descent always move towards minima using all the n points.
Using the SGD:
If we run the lots of iterations of SGD we can reach the MIN value as SGD.

The estimate of the gradient using SGD, each of the updates completely depend of the
derivative and more noisy in SGD.

Can we somehow come up with de noisy gradients from SGD.

Batch SGD with momentum.

The simple way of de noising is taking averages.

These are called the recursive statements.
As we are getting more and more data we are performing a denoising estimate of the SGD.

This is called the exponentially weighted average (or) sums.

This is how we can de noise the data using exponential weighting.

Standard vs. modified SGD equation:

When we use exp weight then we get SGD + Momentum.

Nesterov Accelerated Gradient (NAG):

We have seen SGD with momentum, there is a related algorithm called NAG.

Adagrad:
Learning rate is set to small, the learning rate is same each weight in SGD.
Each weight has a different learning rate in Adagrad.
In our original datasets, some features which are dense and sparse.

AdaGRAD - Is the adaptive learning rate.

Adaptive Learning rate formula:

This is iteration based decay.

Major advantage is there is no need of manually tuning the learning rate.

THe alpha values can become very small, then the weights cannot converge. This could take
slower time to converge.
This problem is fixed in next algorithms.
Adadelta and RMSProp:
AdaDelta:
Instead of taking all the weights in alpha, we will take the exponentially decaying average.

We will replace with exponential decaying average.

Expanding EDA (t-1):

In a nutshell, take exponential weighted averages of gradient^2, rather than the sum of squares
of all the gradients.
Adadelta has the faster convergence.
Adam:
It is most Popular in algorithm.

What if i store the exponential weighted average of the gradients it self, than the squares of the
gradients.

This is a moment estimation algorithm.

The three equations of the Adam algorithm.

Which algorithm to choose when ?

Refer to the blog below:

https://towardsdatascience.com/types-of-optimization-algorithms-used-in-neural-networks-and-
ways-to-optimize-gradient-95ae5d39529f

Gradient Monitoring and Clipping:

It is a good practice to monitor the gradient, we are changing the gradients at every iteration.
Solution is gradient Clipping:

The concept is as follows:

Monitoring gradients is crucial in NN.

Softmax and Cross-Entropy: Logistic regression to multiclass is called softmax.

This is logistic regression:

Softmax Classifier:
Here yi belongs to “K” such classes. Here the summation of all the class belongs to 1.
In the case of softmax classifer, the input that I get at each of the neuron is calculated as
follows:

Formulation:
This satisfys our requirement.
Softmax is the generalization of Logistic regression to multi class setting.

It minimizes the multi – class log loss (Or) Cross - Entropy:

Example of usage of Softmax:
This generates the probability of the class belonging to the each class.

A linear unit looks like this for regression:

It takes the whatever the input and gives the same this is linear unit. We can use the simple
squared loss.
Summerize the MLP’s:

The very important is always monitor gradients, and apply gradient clipping.

We can easily overfit the neural networks.

There are statergies we can overfit and forgot the any one the model can go crazy.

Auto Encoders: This performs the dimensionality reduction using the NN.

We need to get the three dimensional output for the six dimensional input.
The output that we will predict the X it self. We can conclude the middle layer preserves the
input data.
Auto Encoders reference:

https://en.wikipedia.org/wiki/Autoencoder

Deep AE requires the dropout layer.

There are some variations to the AE called denoising autoencoders.

Ppl intentially add noise, this learns the data and leaves the noise and makes the robust noise
free encoder.
Sparse AE:
We will apply loss function with L1 regularization, If we add the L1 reg sparse autoencoder is
achienved.

For better and unsupervised feature representation and extracting the important features in the
data is done by Auto Encoders.

AUTO ENCODERS do the job very very well.

Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Computer Vision and Deep Learning 1708702317
No ratings yet
Computer Vision and Deep Learning 1708702317
93 pages
Be Comp Engg Sem-Viii r2019
No ratings yet
Be Comp Engg Sem-Viii r2019
56 pages
Given A Pattern As Below and An Integer N Your Task Is To Decode It and Print NTH Row of It
No ratings yet
Given A Pattern As Below and An Integer N Your Task Is To Decode It and Print NTH Row of It
9 pages
Optimization in Machine Learning
No ratings yet
Optimization in Machine Learning
26 pages
Decision Trees
No ratings yet
Decision Trees
24 pages
Clustering
No ratings yet
Clustering
28 pages
Deep Feedforward Networks and Regularization: Licheng Zhang
No ratings yet
Deep Feedforward Networks and Regularization: Licheng Zhang
56 pages
v1 Covered
No ratings yet
v1 Covered
29 pages
Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3
No ratings yet
Question Answering On Squad 2.0: Stanford Cs224N Default Project, Option 3
11 pages
Learning Performance of Neuron Model Based On Quantum Superposit
No ratings yet
Learning Performance of Neuron Model Based On Quantum Superposit
6 pages
DNN - M2 - Deep Feedforward NN 23dec
No ratings yet
DNN - M2 - Deep Feedforward NN 23dec
97 pages
Unit 3
No ratings yet
Unit 3
110 pages
Toward A Theory of Optimization For Over-Parameter
No ratings yet
Toward A Theory of Optimization For Over-Parameter
40 pages
Optimization Gradient Descent Method
No ratings yet
Optimization Gradient Descent Method
3 pages
PHD IT Syllabus 01
No ratings yet
PHD IT Syllabus 01
27 pages
3 Deep Learning Overview v3.5
No ratings yet
3 Deep Learning Overview v3.5
85 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
UNIT-2 Foundations of Deep Learning
No ratings yet
UNIT-2 Foundations of Deep Learning
64 pages
DL Unit 1
No ratings yet
DL Unit 1
58 pages
Facial Emotion Detection
No ratings yet
Facial Emotion Detection
10 pages
A New Dust Detection Method For Photovoltaic Panel Surface Based On Pytorch and Its Economic Benefit Analysis
No ratings yet
A New Dust Detection Method For Photovoltaic Panel Surface Based On Pytorch and Its Economic Benefit Analysis
10 pages
The Promise of Convolutional Neural Networks
No ratings yet
The Promise of Convolutional Neural Networks
13 pages
Maritime Vessel Images Classification Using Deep
No ratings yet
Maritime Vessel Images Classification Using Deep
6 pages
3 Non Linear Classifiers
No ratings yet
3 Non Linear Classifiers
74 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
(VDT) Llama-2
No ratings yet
(VDT) Llama-2
39 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
UNIT-III DeepLearning Notes
No ratings yet
UNIT-III DeepLearning Notes
30 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
20 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Play AI - Machine Learning in VIdeo Games
No ratings yet
Play AI - Machine Learning in VIdeo Games
167 pages
Probability Neuron Network
No ratings yet
Probability Neuron Network
84 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
No ratings yet
Machine Learning (ML) :: Aim: Analysis and Implementation of Deep Neural Network. Definitions
6 pages
L4 Training Neural Networks en
No ratings yet
L4 Training Neural Networks en
48 pages
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
No ratings yet
Overfitting Underfitting: UNIT 2: Optimization and Regularization in Neural Networks
18 pages
24 DeepLearning Fa NNDL Tutorial 1
No ratings yet
24 DeepLearning Fa NNDL Tutorial 1
32 pages
DL Mod2
No ratings yet
DL Mod2
45 pages
Project Report On Recommendation System
No ratings yet
Project Report On Recommendation System
26 pages
AI Glossary of Key Terms
No ratings yet
AI Glossary of Key Terms
15 pages
Training Deep Neural Networks
No ratings yet
Training Deep Neural Networks
55 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
Empirical Analysis of Squeeze and Excitation-Based Densely Connected CNN For Chili Leaf Disease Identification
No ratings yet
Empirical Analysis of Squeeze and Excitation-Based Densely Connected CNN For Chili Leaf Disease Identification
12 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
2 Deep Neural Network - 241120 - 095158
No ratings yet
2 Deep Neural Network - 241120 - 095158
47 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
4 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Deep Neural Network
No ratings yet
Deep Neural Network
60 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Lec 8
No ratings yet
Lec 8
43 pages
Module 2
No ratings yet
Module 2
13 pages
Lecture 5-6
No ratings yet
Lecture 5-6
45 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
A Imprimer 4
No ratings yet
A Imprimer 4
4 pages
Deep Learning UNIT-II Part1
No ratings yet
Deep Learning UNIT-II Part1
48 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
26 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Cours 4
No ratings yet
Cours 4
30 pages
Wa0006.
No ratings yet
Wa0006.
70 pages
DL Unit 3
No ratings yet
DL Unit 3
14 pages
Handwritten Digit Recognition Roadmap
No ratings yet
Handwritten Digit Recognition Roadmap
17 pages
AI & ML Unit 5 Notes
No ratings yet
AI & ML Unit 5 Notes
23 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
14 - Học sâu (3) - Improve DNN - v3
No ratings yet
14 - Học sâu (3) - Improve DNN - v3
129 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
ANN Notes
No ratings yet
ANN Notes
7 pages
ANN Model of Degradation
No ratings yet
ANN Model of Degradation
18 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
ML Modelling - Part 1
No ratings yet
ML Modelling - Part 1
7 pages
Cst414-Deep Learning Module 2
No ratings yet
Cst414-Deep Learning Module 2
13 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
ML Unit 4
No ratings yet
ML Unit 4
23 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
02 Neural Networks
No ratings yet
02 Neural Networks
32 pages
Regularization and Normalization
No ratings yet
Regularization and Normalization
29 pages
Unit 5 (Second Half)
No ratings yet
Unit 5 (Second Half)
10 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Module 2
No ratings yet
Module 2
12 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Module4 AI
No ratings yet
Module4 AI
12 pages
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet

Deep MLP's

Uploaded by

Deep MLP's

Uploaded by

Deep Multi-layer perceptrons.

Fully connected Multilayer perceptron.

Rectified Linear Units(ReLU):

ReLU is the best activation function.

We have ReLU to overcome the vanishing gradient problem.

Noisy ReLU’s - This is not the most used one in MLP’s.

The data must be NORMALIZED for DNN.

Solution for the above cases:

Idea 1: Weights from the normal distribution.

Fan IN: The number of inputs to the neuron.

There is no concrete agreement among all the researches.

The major advantages are

1. Makes faster convergence.

Global minima, Local minima:

Convex functions and non convex functions:

For 3D and contours:

Simple mini batch SGD gets stuck at the N-dimensional same.

THe SGD is the approx and erroneous.

Can we somehow come up with de noisy gradients from SGD.

Batch SGD with momentum.

The simple way of de noising is taking averages.

This is called the exponentially weighted average (or) sums.

Standard vs. modified SGD equation:

Nesterov Accelerated Gradient (NAG):

AdaGRAD - Is the adaptive learning rate.

Adaptive Learning rate formula:

Major advantage is there is no need of manually tuning the learning rate.

We will replace with exponential decaying average.

This is a moment estimation algorithm.

Which algorithm to choose when ?

Refer to the blog below:

Gradient Monitoring and Clipping:

The concept is as follows:

Softmax and Cross-Entropy: Logistic regression to multiclass is called softmax.

It minimizes the multi – class log loss (Or) Cross - Entropy:

A linear unit looks like this for regression:

We can easily overfit the neural networks.

Deep AE requires the dropout layer.

AUTO ENCODERS do the job very very well.

You might also like