Deep Learning Notes-2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

AM11 Om Nagvekar

Deep Learning Notes


Artificial Neural Network (ANN)

Activation Function:-
• Introduces non-linearity in the output of neuron.
• In neural network if all hidden are linear it is treated as one hidden layer
unless or until non linearity is introduced.
• The primary role of the Activation Function is to transform the summed
weighted input from the node into an output value to be fed to the next
hidden layer or as output.
• It is generally at output layer
• Activation functions : linear, ReLu, sigmoid , tanh, softmax, etc

1|Page
AM11 Om Nagvekar

• In linear activation function the derivative is constant the model


does not linear anything from it
• f(x) = x
f(x) = 1

• value exist between 0 to 1.


• In deep learning, the gradient of a sigmoid activation function is
used to update the weights & biases of a neural network.

2|Page
AM11 Om Nagvekar

• It is often used as the last activation function of a neural


network to normalize the output of a network to a probability
distribution over predicted output classes.
• Softmax is an activation function that scales numbers/logits
into probabilities.

3|Page
AM11 Om Nagvekar

Loss Function :
• A loss function calculates the error between actual value and predicted
value so model can make changes in weight in next iteration
• Gradually with the help of some optimization algorithm loss function
learns to reduce the loss.
• It is used in CNN, ANN, RNN, DNN, etc

Cost Function :
• If you have multiple training values it will be cost function
• Cost function in overall loss over the model training
Regression Loss :
• It is used in regression problems like linear regression.
• Eg. MSE (Mean Squrd Error also known as L2 Loss) , Mean Absolute
Error, Mean Bias Error, Epsilion Error

• MAE is better than MSE because it reduces the impact of outlier on


output or model.
Classification Loss :
• Binary Cross Entropy, Balanced Cross Entropy, Hinge Loss, Softmax
Loss, Active contour Loss, etc.
• Sigmoid Cross Entropy or log likelihood. It is a sigmoid activation plus
a cross entropy loss

• Weighted Cross entropy is when there is unbalanced class


• Balanced cross entropy is same as weighted cross entropy
• Categorical cross entropy it is called also softmax loss

4|Page
AM11 Om Nagvekar

How on each Layer matrix dimension changes ?


n^[l] * n^[l-1]
here n No. of is neuron on the layer
l is layer
l-1 is previous layer

5|Page
AM11 Om Nagvekar

Gradient Descent:

• How to adjust weight if loss is higher:


W = W- alpha* dj/dw
Here w is Weight
Alpha is learning rate
dj/dw is sloper at that point
• alpha is commonly between 0.1 to 0.001
• alpha is also called as Hyperparameter
• This is example of stochastic gradient descent (sgd) optimizer
• Types are sgd, batch gd mini batch gd
• Batch-size is normally in power of >=2^8
Chain Rule:

6|Page
AM11 Om Nagvekar

Vanishing Gradient problem :


• The vanishing gradient problem is a phenomenon that occurs during the
training of deep neural networks. It happens when the gradients used to
update the network become too small or "vanish" as they are
backpropagated from the output layers to the earlier layers.
• The exploding gradient problem occurs when gradients become very large
during backpropagation. This can lead to a rapid increase in values as they
are propagated backward through the layers.
• Solution
o Reduce the complexity
o Weight initialization
o Residual Path
o Use ReLu or Leaky Leaky ReLu activation function
o Batch Normalization
o Gradient clipping
Hyper Parameter :
o Learning rate, epoch, optimizer (Adam, sgd, RMSprop), Activation
function, Loss function, No. of Neurons, No. of Layers, batch size, train,
Validation data, Drop out
Variance And Bias:
Regularization:
Regularization is adding a component noise in loss function and minimize the
same.
Regularization refers to techniques that are used to calibrate machine learning
models in order to minimize the adjusted loss function and prevent overfitting or
underfitting
1. L2 Regularization:
o L2 regularization is a regularization technique used in deep learning to
prevent neural networks from overfitting on training data. It's also known
as weight decay or Ride Regression.
o L2 regularization prevents weights from becoming too large.
o L2 regularization ensures that the important components in the weight
vector are larger than the other components.
o L2 regularization adds the square of the weights to the loss function. This
tends to evenly distribute the importance across all features, reducing the
magnitude of weights and preventing them from growing too large.

7|Page
AM11 Om Nagvekar

Cost function = Loss + λ * ∑||w||^2


In the cost function, the penalty term is represented by Lambda λ By changing
the values of the penalty function, we are controlling the penalty term The
higher the penalty, it reduces the magnitude of coefficients It shrinks the
parameters Therefore, it is used to prevent multicollinearity, and it reduces the
model complexity by coefficient shrinkage
Loss = 0 (considering the two points on the)
λ=1 w = 1.4
Then, Cost function = 0 + 1 x ( 1.4 )^2 = 1.96
•For Ridge Regression, let s assume,
Loss = 0.13 λ = 1 w = 0.7
Then, Cost function = 0.13 + 1 x ( 0.7)^ 2 = 0.62
Note : For L1 Regularization Cost function = Loss + λ * ∑||w||

Exponentially weighted Moving Average:

8|Page
AM11 Om Nagvekar

RMSProp :

9|Page
AM11 Om Nagvekar

Adam Optimizer:

where
Alpha is learning rate
Beta is exponential moving sum (momentum): dw^2
Beta 2 is RMSProp

10 | P a g e
AM11 Om Nagvekar

AdaGrad :

There is chance that After some epoch learning stops because of beta is not
there in adagrad

Internal Covariate shift :


• The term interal covariate shift comes from the paper Batch
Normalization: Accelerating Deep Network Training by Reducing
Internal Covariate Shift.
• We define Internal Covariate Shift as the change in the distribution of
network activations due to the change in network parameters during
training.
• In neural networks, the output of the first layer feeds into the second
layer, the output of the second layer feeds into the third, and so on. When
the parameters of a layer change, so does the distribution of inputs to
subsequent layers.
• These shifts in input distributions can be problematic for neural networks,
especially deep neural networks that could have a large number of layers.
• Batch normalization is a method intended to mitigate internal covariate
shift for neural networks.
• In Batch normailization substract mean and divide by Square root of
Standard Deviation.

11 | P a g e
AM11 Om Nagvekar

Convolutional Neural Network (CNN):


• The one advantage Of CNN over ANN Is that it shares weight so it
requires less weight compared to ANN if ANN has 100 Neuron it will
requires 100*100 weights.

12 | P a g e
AM11 Om Nagvekar

LNet-5:

Alexnet :

13 | P a g e
AM11 Om Nagvekar

Google Net:

14 | P a g e
AM11 Om Nagvekar

ResNet:

15 | P a g e
AM11 Om Nagvekar

GAN(Generative Adversarial Network) :

16 | P a g e

You might also like