02 Neural Networks
02 Neural Networks
Deep Neural
Networks
x1
z1
x2
z2 z
x3
z3
1
Definitions
●
Deep neural networks have several layers in their configuration
●
The type of neural network we have been describing thus far is also
known as Multilayer Perceptron
●
Each neuron is fully connected with the other units of the neural
network. So it is also known as fully connected neural network or
dense network.
●
Because of the vast number of parameters involved, these models
tend to overfit quite easily.
More about activation functions
●
Each unit in a neural network has an activation function attached to
it.
●
The choice of activation depends on the following factors:
●
Location in the neural network
●
Let’s now look further into activation functions
Most widely used activation functions
Sigmoid
●
The Sigmoid function is used for making binary predictions on a
dataset.
●
It is given by:
●
And its derivative is:
●
This function’s range is [0-1]
Hyperbolic tangent tanh
●
The tanh function is a popular activation function for neurons in
hidden layers
●
It is given by:
●
And its derivative is:
●
This function’s range is [-1, 1]
SoftPlus
●
As an alternative to tanh, a Softplus function could be used for
neurons in hidden layers
●
It is given by:
●
And its derivative is:
ReLU (Rectified Linear Unit)
●
The ReLU function is used as an activation for neurons located in
hidden layers
●
It is given by:
●
And its derivative is:
Leaky ReLU (Leaky Rectified Linear Unit)
●
In addition, the Leaky ReLU function can be used for better stability
●
It is given by:
●
And its derivative is:
Exercise
●
Find the derivatives of the following functions:
Forward propagation
●
Forward propagation is the process of passing through data from the input
layer to the output of a neural network.
●
It involves evaluating the logits and activation functions of each neuron in the
whole set.
●
Let us consider a popular example: Derive the forward propagation equations
for the XOR neural network given by:
Forward propagation
●
XOR forward propagation
Backpropagation
●
Backprop is the process of training a neural network.
●
It involves updating the values of all parameters.
●
Backpropagation of a neural network requires us to calculate the
derivatives of the error function with respect to all layer-level parameters.
●
Let’s now analyze the XOR case in more detail
Backpropagation
Backpropagation
Backpropagation
Forward propagation (Matrix form)
Backward propagation (Matrix form)
●
Where Nb is the size of each batch.
●
Batches are typically multiples of 2. For instance, 16, 32, 64, etc.
Stochastic Gradient Descent
●
Instead of visiting the whole dataset or parts of it, we can randomly sample a
point or a batch of points in our dataset and then use it to update the
parameters.
●
This is exactly what Stochastic Gradient Descent does
●
Stochastic Gradient Descent can be defined as:
Comparison of Gradient Descent variants
AdaGrad
●
It is an optimization method that controls the learning rate by
summing up the squared gradients up to the current iteration.
●
Therefore, the main formula is:
●
Where θ is the set of parameters of the neural network.
RMSProp
●
Adadelta we covered before and RMSProp were developed
independently, where RMSProp is esentially the same as Adadelta,
but with predefined momentum values:
●
And the learning rate is to be chosen carefully. A value of 0.001
would be considered good.
Adaptive Moment Estimation (Adam)
●
Besides storing a decaying average of past square gradients, Adam
also keeps a decaying average of past gradients, given by:
●
Where mt and vt are the estimates for the first and second moment of
the gradients (mean and variance). These are typically initialized as
zero vectors. To counteract the bias towards zero, the following are
defined:
Comparison of optimization methods
Miscellaneous characteristics: Dropout
Miscellaneous characteristics: Batch normalization