0% found this document useful (0 votes)
10 views32 pages

02 Neural Networks

The document discusses deep neural networks, focusing on their architecture, activation functions, training processes, and optimization methods. It covers key concepts such as feed-forward and backpropagation, various activation functions like Sigmoid and ReLU, and optimization techniques including Gradient Descent and Adam. Additionally, it highlights the importance of special layers and methods like dropout and batch normalization in enhancing neural network performance.

Uploaded by

Enrique Otaku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views32 pages

02 Neural Networks

The document discusses deep neural networks, focusing on their architecture, activation functions, training processes, and optimization methods. It covers key concepts such as feed-forward and backpropagation, various activation functions like Sigmoid and ReLU, and optimization techniques including Gradient Descent and Adam. Additionally, it highlights the importance of special layers and methods like dropout and batch normalization in enhancing neural network performance.

Uploaded by

Enrique Otaku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Unit 2

Deep Neural
Networks

Oscar Contreras Carrasco


UNIVALLE
2024
Contents

General definitions

Feed-forward neural network architecture

Activation functions

Neural network training

Optimization methods

Special layers
Definitions

At this point, we learned about linear models and their
characteristics. Therefore, we are ready to delve into neural networks
and their characteristics.

First of all, we are going to provide some motivations in terms of
what the current state of the art regarding neural networks.

Let’s get started
General structure of a deep neural network

A deep neural is composed of several hidden layers, where each one has
several units.

We can depict a deep neural network in this way:

x1

z1

x2

z2 z
x3

z3

1
Definitions

Deep neural networks have several layers in their configuration

The type of neural network we have been describing thus far is also
known as Multilayer Perceptron

Each neuron is fully connected with the other units of the neural
network. So it is also known as fully connected neural network or
dense network.

Because of the vast number of parameters involved, these models
tend to overfit quite easily.
More about activation functions

Each unit in a neural network has an activation function attached to
it.

The choice of activation depends on the following factors:

Location in the neural network

Let’s now look further into activation functions
Most widely used activation functions
Sigmoid

The Sigmoid function is used for making binary predictions on a
dataset.

It is given by:


And its derivative is:


This function’s range is [0-1]
Hyperbolic tangent tanh

The tanh function is a popular activation function for neurons in
hidden layers

It is given by:


And its derivative is:


This function’s range is [-1, 1]
SoftPlus

As an alternative to tanh, a Softplus function could be used for
neurons in hidden layers

It is given by:


And its derivative is:
ReLU (Rectified Linear Unit)

The ReLU function is used as an activation for neurons located in
hidden layers

It is given by:


And its derivative is:
Leaky ReLU (Leaky Rectified Linear Unit)

In addition, the Leaky ReLU function can be used for better stability

It is given by:


And its derivative is:
Exercise

Find the derivatives of the following functions:
Forward propagation

Forward propagation is the process of passing through data from the input
layer to the output of a neural network.

It involves evaluating the logits and activation functions of each neuron in the
whole set.

Let us consider a popular example: Derive the forward propagation equations
for the XOR neural network given by:
Forward propagation

XOR forward propagation
Backpropagation

Backprop is the process of training a neural network.

It involves updating the values of all parameters.

Backpropagation of a neural network requires us to calculate the
derivatives of the error function with respect to all layer-level parameters.

Let’s now analyze the XOR case in more detail
Backpropagation
Backpropagation
Backpropagation
Forward propagation (Matrix form)
Backward propagation (Matrix form)

Where u is a N x 1 vector of ones


More on optimization methods

Gradient descent is not an efficient optimization method, especially
when dealing with high volumes of data.

We will now cover some optimization methods of interest we can
use in different scenarios.
Vanilla Gradient Descent

All methods we will describe are derived from Gradient Descent.

Gradient Descent takes all data from the dataset and uses it to
update the parameters.

We have used it extensively when we were working with previous
linear models.

Its main expression is:
Minibatch Gradient Descent

The main difference between Gradient Descent and Batch Gradient
Descent is that the latter defines small batches of data to update the
parameters, instead of loading the whole dataset in memory.

It is defined by:


Where Nb is the size of each batch.

Batches are typically multiples of 2. For instance, 16, 32, 64, etc.
Stochastic Gradient Descent

Instead of visiting the whole dataset or parts of it, we can randomly sample a
point or a batch of points in our dataset and then use it to update the
parameters.

This is exactly what Stochastic Gradient Descent does

Stochastic Gradient Descent can be defined as:
Comparison of Gradient Descent variants
AdaGrad

It is an optimization method that controls the learning rate by
summing up the squared gradients up to the current iteration.

Therefore, the main formula is:


Where θ is the set of parameters of the neural network.
RMSProp

Adadelta we covered before and RMSProp were developed
independently, where RMSProp is esentially the same as Adadelta,
but with predefined momentum values:


And the learning rate is to be chosen carefully. A value of 0.001
would be considered good.
Adaptive Moment Estimation (Adam)

Besides storing a decaying average of past square gradients, Adam
also keeps a decaying average of past gradients, given by:


Where mt and vt are the estimates for the first and second moment of
the gradients (mean and variance). These are typically initialized as
zero vectors. To counteract the bias towards zero, the following are
defined:
Comparison of optimization methods
Miscellaneous characteristics: Dropout
Miscellaneous characteristics: Batch normalization

You might also like