0% found this document useful (0 votes)
44 views49 pages

6COM1044 Deep Learning 1

The document discusses neural networks and deep learning, including an introduction to concepts like activation functions, loss functions, backpropagation, and the unreasonable effectiveness of deep learning. It also provides references for further reading.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views49 pages

6COM1044 Deep Learning 1

The document discusses neural networks and deep learning, including an introduction to concepts like activation functions, loss functions, backpropagation, and the unreasonable effectiveness of deep learning. It also provides references for further reading.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Neural Networks and Deep Learning 1

Dr. Shabnam N. Kadir

University of Hertfordshire

March 16, 2022

S.N. Kadir Neural Networks and Deep Learning 1


References

https://www.deeplearningbook.org/
https://machinelearningmastery.com/
inspirational-applications-deep-learning/
https://d2l.ai/

S.N. Kadir Neural Networks and Deep Learning 1


Introduction to Tensorflow

https://www.deeplearningbook.org/
We shall be using Python and Tensorflow for implementation
Jupyter notebooks
https://colab.research.google.com/notebooks/
welcome.ipynb

S.N. Kadir Neural Networks and Deep Learning 1


Machine Learning Algorithms

A ML algorithm is an algorithm that is able to learn from data.


Mitchell (1997) —A computer program is said to learn from
experience E with respect to some class of tasks T, and
performance measure P, if its performance at task T, as
measured by P, improves with experience E.”
ML enables us to tackle tasks that are too difficult to solve
with fixed programs written and designed by humans.

S.N. Kadir Neural Networks and Deep Learning 1


What is deep learning?

MIT notes by A. Amini 2019

S.N. Kadir Neural Networks and Deep Learning 1


Recall: Neurons and the Perceptron

S.N. Kadir Neural Networks and Deep Learning 1


Deep Learning

Deep learning is a ML technique that employs deep neural


networks.
A deep neural network is a multi-layered neural network that
contains two or more hidden layers
The weights of this neural network need to be adjusted so
that a loss/cost/error function is minimised.

S.N. Kadir Neural Networks and Deep Learning 1


Feed-forward Neural Network Architecture

No feedback connections in which outputs are fed back into


itself.
Important application: Object recognition. Convolutional
neural networks (CNNs) (next week’s lecture) are a specialized
type of FNN inspired by the visual system of the brain.
S.N. Kadir Neural Networks and Deep Learning 1
Deep Neural Networks

S.N. Kadir Neural Networks and Deep Learning 1


Why now?

The algorithms used to


train deep neural
networks have been
around for decades!

S.N. Kadir Neural Networks and Deep Learning 1


Why now?

S.N. Kadir Neural Networks and Deep Learning 1


Universal Approximation Theorem

Theorem (Hornik 1989, Cybenko 1989) :


”A feedforward neural network with a single layer is sufficient
to approximate, to arbitrary precision, any continuous
function.” (Hornik, K., et al. ”Multilayer feedforward networks are universal
approximators.” (1989))

S.N. Kadir Neural Networks and Deep Learning 1


Universal Approximation Theorem:

The universal approximation theorem means that regardless of


what function we are trying to learn, we know that a large
MLP will be able to represent this function.
We are not guaranteed, however, that the training algorithm
will be able to learn that function. Even if the MLP is able to
represent the function, learning can fail for two different
reasons.
First, the optimization algorithm used for training may not be
able to find the value of the parameters that corresponds to
the desired function.
Second, the training algorithm might choose the wrong
function as a result of overfitting.

S.N. Kadir Neural Networks and Deep Learning 1


The unreasonable effectiveness of deep learning

S.N. Kadir Neural Networks and Deep Learning 1


Quality of a neural network

Expressibility: What class of functions can the neural


network express?
Efficiency: How many resources (neurons, parameters, etc.)
does the neural network require to approximate a given
function?
Learnability: How rapidly can the neural network learn good
parameters for approximating a function?

S.N. Kadir Neural Networks and Deep Learning 1


The unreasonable effectiveness of deep learning

To express the same function, deeper neural networks often


require exponentially fewer neurons than shallow networks.

S.N. Kadir Neural Networks and Deep Learning 1


The unreasonable effectiveness of deep learning

S.N. Kadir Neural Networks and Deep Learning 1


Activation Functions

We cannot use a step function as activation function as we


did for the single perceptron.
The composite function produced by the interconnected
perceptrons with a discontinuous activation function will also
be discontinuous, as will the loss function. A differentiable
activation function makes the function computed by a neural
network differentiable
Linear g (h) = h

S.N. Kadir Neural Networks and Deep Learning 1


Why we need non-linear activation functions

S.N. Kadir Neural Networks and Deep Learning 1


Choice of activation function

While the original theorems were first stated in terms of units


with activation functions that saturate for both very negative
and very positive arguments, e.g. sigmoid.
Universal approximation theorems have also been proved for a
wider class of activation functions, which includes rectified
linear unit (ReLU) (Leshno et al. 1993)

S.N. Kadir Neural Networks and Deep Learning 1


Leshno et al. 1993

S.N. Kadir Neural Networks and Deep Learning 1


Activation Functions

Prior to the introduction of rectified linear units, most neural


networks used the logistic sigmoid activation function or the
hyperbolic tangent:
1
Sigmoidal g (h) = (1+e −h )
(e h −e −h )
Tanh g (h) = (e h +e −h )

S.N. Kadir Neural Networks and Deep Learning 1


Choice of Activation Functions

S.N. Kadir Neural Networks and Deep Learning 1


Softmax

e hi
Softmax yi = P hj .
j e

Often used in the final layer of a neural network-based


classifier.
https://towardsdatascience.com/
cross-entropy-loss-function-f38c4ec8643e

S.N. Kadir Neural Networks and Deep Learning 1


Example of a Neural network

S.N. Kadir Neural Networks and Deep Learning 1


Composition of functions in Deep Neural Network

S.N. Kadir Neural Networks and Deep Learning 1


Loss/Cost/Error functions
Target output ti , actual output from network yi . There are p
training examples.
The sum of squared error (sometimes call J or L):
p
1X
E= (ti − yi )2
2
i=1

Cross entropy
p
X
E =− (ti log(yi ) − (1 − ti ) log(1 − yi ))
i=1

(The cross entropy of two different discrete probability


distributions is:
X
H(p, q) = − p(x) log q(x)
x∈X

S.N. Kadir Neural Networks and Deep Learning 1


Example: Cross-entropy loss function

Picture from https://towardsdatascience.com/


cross-entropy-loss-function-f38c4ec8643e

S.N. Kadir Neural Networks and Deep Learning 1


Backpropagation

Backpropagation algorithm looks for the minimum of the loss


function in weight space using the method of gradient descent.
The combination of weights that minimizes the loss function
is considered to be a solution of the learning problem.
Since this method requires computation of the gradient of the
loss function at each iteration step, the loss function needs to
be continuous and differentiable.

S.N. Kadir Neural Networks and Deep Learning 1


Backpropagation

The algorithm can be decomposed into the following steps after


the initialization of weights (e.g. randomly).
1 Feed-forward computation
2 Backpropagation to the output layer
3 Backpropagation to each hidden layer
4 Weight updates
The algorithm is stopped when the value of the error function has
become sufficiently small.
As always, at the very least, divide your data into a test set and a
training set (additionally a validation set).

S.N. Kadir Neural Networks and Deep Learning 1


Chain Rule

If g is differentiable at x and f is differentiable at g (x), then the


derivative of the composition function can be found using the
Chain Rule:
d d
[f ◦ g (x)] ≡ [f (g (x))] = f 0 (g (x))g 0 (x).
dx dx
If y = f (u) and u = g (x), then y = f (g (x)), and the chain rule
can be written:
dy dy du
=
dx du dx

S.N. Kadir Neural Networks and Deep Learning 1


Backpropagation

S.N. Kadir Neural Networks and Deep Learning 1


Backpropagation

S.N. Kadir Neural Networks and Deep Learning 1


Backpropagation: Chain Rule

Input sum of neuron k in layer l:


X
zkl = wkjl ajl−1 + bkl
j

Activation function:
akl = f (zkl ),
where f could be σ.
X
zkl+1 = l+1 l
wmk l+1
ak + b m
k

Now the Chain Rule a lot!


∂E
∂wkjl

S.N. Kadir Neural Networks and Deep Learning 1


Generalised delta rule
Let f be a transfer function, i.e. Oip = f (ui ), where ui = j wij xj .
P
Then
   2 
∂E ∂ 1 X
y p − f 
X
∆wij (t) = −η = −η i wij xj  
∂wij ∂wij 2
i,p j
X p p 0
=η (yi − Oi )f (ui )xj .
p

where f 0 (ui ) = du
df
i
, is the derivative of f (ui ) with respect to ui . If
we update weights example by example,

∆wij (t) = η(yip − Oip )f 0 (ui )xj = ηδip xj

where δip = (yip − Oip )f 0 (ui ). This is known as the generalised


delta rule. The need for f 0 is a key reason why we need continuous
transfer functions.
S.N. Kadir Neural Networks and Deep Learning 1
Gradient descent

From https://www.deeplearningbook.org/

S.N. Kadir Neural Networks and Deep Learning 1


Gradient descent

From https://www.deeplearningbook.org/

S.N. Kadir Neural Networks and Deep Learning 1


Gradient descent

The non-linearity of activation functions causes most interesting


loss functions to become non-convex.

S.N. Kadir Neural Networks and Deep Learning 1


Gradient descent

From https://www.deeplearningbook.org/

S.N. Kadir Neural Networks and Deep Learning 1


Adaptive Learning Rules

Learning rates are no longer fixed.


Can be adjusted according to factors such as:
1 The size of the gradient
2 The size of particular weights
3 How fast learning is occuring
4 etc.

S.N. Kadir Neural Networks and Deep Learning 1


Batches, Epochs

Batch size is a hyperparameter which defines the number of


data samples to work through before updating weights.
Batch Gradient Descent: Batch size = Size of training set
Stochastic Gradient Descent (SGD): Batch size = 1
Mini-batch gradient Descent 1 < Batch Size < Size of
training set
An epoch has passed when all data samples in the training set
have been fed to the neural network during the training
process.
https://machinelearningmastery.com/
difference-between-a-batch-and-an-epoch/

S.N. Kadir Neural Networks and Deep Learning 1


SGD and non-convexity

Convex optimization algorithms with global convergence


guarantees used to train logistic regression or SVMs.
Stochastic gradient descent (SGD) applied to nonconvex loss
functions has no such convergence guarantee and is sensitive
to the values of the initial parameters.
SGD is only guaranteed to converge at a local minimum.
This may not be as bad as it sounds (overfitting).

S.N. Kadir Neural Networks and Deep Learning 1


Goodfellow 2017

S.N. Kadir Neural Networks and Deep Learning 1


SGD initialization of weights

For feedforward neural networks, it is important to initialize all


weights to small random values.
The biases may be initialized to zero or to small positive
values.

S.N. Kadir Neural Networks and Deep Learning 1


SGD sequential vs batch modes

The sequential mode of training is also known as on-line,


pattern, or stochastic mode. In this mode, weights are
updated after the presentation of each example.
The term ”stochastic” comes from the fact that the gradient
based on a single training sample is a ”stochastic
approximation” of the ”true” cost gradient.
In batch mode, weights are updated only after the complete
presentation of all examples in the training set, i.e., only after
each sweep or epoch. Here your error function is typically the
total sum of the errors obtained for each example in the
dataset. Impractical for a very large dataset.

S.N. Kadir Neural Networks and Deep Learning 1


Regularization

Regularization, in the context of machine learning, refers to


the process of modifying a learning algorithm so as to prevent
overfitting.

S.N. Kadir Neural Networks and Deep Learning 1


Dropout regularization
Randomly drop units
(along with their
connections) from the
neural network during
training
Dropout is a training
strategy which ignores a
proportion (e.g. half) of
the hidden neurons,
randomly, when training
the weights (not
updating their weights)
from Srivastava et al. 2014 and setting their
activation to zero.
A different set of neurons
is dropped on each
iteration.
S.N. Kadir Neural Networks and Deep Learning 1
Benefits of Dropout regularization

Forces networks not to rely on any one node (discourages


memorization).
Robustness: Randomly ignoring nodes prevents excessive
inter-dependencies from emerging between nodes (i.e. nodes
do not learn functions which rely on specific input values from
another node), this allows the network to learn more a more
robust relationship.
Akin to a brain losing a few neurons but still being able to do
a task.
Implementing dropout has much the same effect as taking the
average from a committee of networks, however the cost is
significantly less in both time and storage required.

S.N. Kadir Neural Networks and Deep Learning 1


Regularization: Early stopping

MIT notes by A. Amini 2019

S.N. Kadir Neural Networks and Deep Learning 1

You might also like