0% found this document useful (0 votes)
71 views75 pages

7 Deep Learning

The document discusses machine learning and deep learning. It explains that deep learning uses neural networks with multiple layers to learn representations of data and find patterns in large amounts of information. Convolutional neural networks are introduced as a type of deep learning that is inspired by the human visual system. Convolutional layers contain filters that are convolved across input data to learn features, with shared weights and parameters across different parts of the input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views75 pages

7 Deep Learning

The document discusses machine learning and deep learning. It explains that deep learning uses neural networks with multiple layers to learn representations of data and find patterns in large amounts of information. Convolutional neural networks are introduced as a type of deep learning that is inspired by the human visual system. Convolutional layers contain filters that are convolved across input data to learn features, with shared weights and parameters across different parts of the input.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 75

Machine Learning

Deep Learning
Neural Network (Deep Learning)

x1

x2
Output
x3

Layer L4
x4
Layer L3
Layer L1 Layer L2
What is Deep Learning (DL)
A machine learning subfield of learning representations of data.
Exceptional effective at learning patterns.
Deep learning algorithms attempt to learn (multiple levels of)
representation by using a hierarchy of multiple layers
If you provide the system tons of information, it begins to
understand it and respond in useful ways.

https://www.xenonstack.com/blog/static/public/uploads/media/machine-learning-vs-deep-learning.png
Why is DL useful?
o Manually designed features are often over-
specified, incomplete and take a long time to
design and validate
o Learned Features are easy to adapt, fast to learn
o Deep learning provides a very flexible, (almost)
universal, learnable framework for representing
world, visual and linguistic information.
o Effective end-to-end joint system learning
o Utilize large amounts of training data
what exactly is deep learning ?

why is it generally better than other methods on image, speech


and certain other types of data?

The short answers


1. ‘Deep Learning’ means using a neural network
with several layers of nodes between input and output

2. the series of layers between input & output do


feature identification and processing in a series of stages,
just as our brains seem to.
Theoretical Advantages of Deep
Architectures
 Some complicated functions cannot be efficiently represented (in terms
of number of tunable elements) by architectures that are too shallow.
 Deep architectures might be able to represent some functions
otherwise not efficiently representable.
 More formally:
Functions that can be compactly represented by a depth k architecture
might require an exponential number of computational elements to be
represented by a depth k − 1 architecture
 The consequences are
 Computational: We don’t need exponentially many elements in
the layers
 Statistical: poor generalization may be expected when using an

insufficiently deep architecture for representing some functions. 9


Data and machine learning

New AI methods
(deep learning)
Performance

Most learning
algorithms

Amount of data
Things we want to do with data

Images Label image

Audio Speech recognition

Text Web search


Why is computer vision hard?

The camera sees :


Computer vision

Learning

algorithm
Computer vision

Feature Learning
representation
algorithm
Features for vision

SIFT GIST Shape context


Features for machine learning

Images

Image Vision features Detection

Audio

Audio Audio features Speaker ID

Tex Web search


t …

Tex Text features


t
Why is speech recognition hard?
Microphone recording:

Please find the coffee mug


Features for audio

Spectrogram MFCC Flux


Features for text

Parser Named entity Stemming


The idea:

Build learning algorithms


that mimic the brain.

Most of human intelligence may


be due to one learning algorithm.
But, until very recently, our weight-learning
algorithms simply did not work on multi-layer
architectures
Along came deep learning …
The new way to train multi-layer NNs…
The new way to train multi-layer NNs…

Train this layer first


The new way to train multi-layer NNs…

Train this layer first


then this layer
The new way to train multi-layer NNs…

Train this layer first


then this layer
then this layer
The new way to train multi-layer NNs…

Train this layer first


then this layer
then this layer
then this layer
The new way to train multi-layer NNs…

Train this layer first


then this layer
then this layer
then this layer
finally this layer
The new way to train multi-layer NNs…

Basically, it is forced to learn good


features that describe what comes from
the previous layer
Deep Learning: Main Problems

Deep learning multilayer perceptrons suffer from two main problems:


overfitting and vanishing gradient.
 Overfitting results mainly from the increased number of adaptable parameters.
 Weight decay prevents large weights and thus an overly precise adaptation.
 Sparsity constraints help to avoid overfitting:
 there is only a restricted number of neurons in the hidden layers or
 only few of the neurons in the hidden layers should be active (on average).
 May be achieved by adding a regularization term to the error function

(compares the observed number of active neurons with desired number


and pushes adaptations into a direction that tries to match these numbers).
 Furthermore, a training method called dropout training may be applied:

some units are randomly omitted from the input/hidden layers during training.
Deep Learning: Vanishing
Gradient

If a logistic activation function is used (shown on left), the weight changes are proportional to f(x)-
(1-f(x))
This factor is also propagated back, but cannot be larger than ¼ (see right).
⇒ The gradient tends to vanish if many layers are backpropagated through.
Learning in the early hidden layers can become very slow [Hochreiter 1991].
Deep Learning: Vanishing
Gradient
The logistic activation function is a contracting function:
(Obvious from the fact that its derivative is always < 1; actually ≤ 1 / 4.)
• If several logistic functions are chained, these contractions combine and
yield an even stronger contraction of the input range.
• As a consequence, a rather large change of the input values will produce
only a rather small change in the output values, and the more so, the more
logistic functions are chained together.
• Therefore the function that maps the inputs of a multilayer perceptron to
its outputs usually becomes the flatter the more layers the multilayer
perceptron has.
• Consequently the gradient in the first hidden layer (were the inputs are
processed) becomes the smaller.
Convolutional Neural Networks
(CNN)
 We know it is good to learn a small model.
 From this fully connected model, do we really need all the
connections?
 Can some of these be shared?
Deep Learning: Convolutional
Neural Networks (CNNs)
It is advantageous that the features constructed in hidden layers are
not localized to a specific part of the image.
• Special form of deep learning multi-layer perceptron: convolutional
neural network.
• Inspired by the human retina, where sensory neurons have a
receptive field, that is, a limited region in which they respond to a
(visual) stimulus.
• Each neuron of the (first) hidden layer is connected to a small number of input
neurons that refer to a contiguous region of the input image (left).
• Connection weights are shared, same network is evaluated at different locations.
The input field is “moved” step by step over the whole image (right).
• Equivalent to a convolution with a small size kernel.
Convolutional Neural Networks
(CNNs)

Convolutional
Input matrix 3x3 filter
Consider learning an image:
 Some patterns are much smaller than the whole image

Can represent a small region with fewer parameters

“beak” detector
Same pattern appears in different places:
They can be compressed!
What about training a lot of such “small” detectors
and each detector must “move around”.

“upper-left
beak” detector

They can be compressed


to the same parameters.

“middle beak”
detector
A convolutional layer

A CNN is a neural network with some convolutional layers


(and some other layers). A convolutional layer has a number
of filters that does convolutional operation.

Beak detector

A filter
Convolution
These are the network
parameters to be learned.

1 -1 -1
1 0 0 0 0 1 -1 1 -1 Filter 1
0 1 0 0 1 0 -1 -1 1
0 0 1 1 0 0
1 0 0 0 1 0 -1 1 -1
-1 1 -1 Filter 2
0 1 0 0 1 0
0 0 1 0 1 0 -1 1 -1



6 x 6 image
Each filter detects a
small pattern (3 x 3).
Convolution 1 -1 -1
Filter 1
-1 1 -1
-1 -1 1
stride=1

1 0 0 0 0 1 Dot
product
0 1 0 0 1 0 3 -1
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
If stride=2

1 0 0 0 0 1
0 1 0 0 1 0 3 -3
0 0 1 1 0 0
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0

6 x 6 image
Convolution 1 -1 -1
-1 1 -1 Filter 1
-1 -1 1
stride=1

1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
0 1 0 0 1 0
0 0 1 0 1 0 -3 -3 0 1

6 x 6 image 3 -2 -2 -1
-1 1 -1
Convolution -1 1 -1 Filter 2
-1 1 -1
stride=1
Repeat this for each filter
1 0 0 0 0 1
0 1 0 0 1 0 3 -1 -3 -1
-1 -1 -1 -1
0 0 1 1 0 0
1 0 0 0 1 0 -3 1 0 -3
-1 -1 -2 1
0 1 0 0 1 0 Feature
0 0 1 0 1 0 -3 -3 Map0 1
-1 -1 -2 1
6 x 6 image 3 -2 -2 -1
-1 0 -4 3
Two 4 x 4 images
Forming 2 x 4 x 4 matrix
Color image: RGB 3 channels
11 -1-1 -1-1 -1-1 11 -1-1
1 -1 -1 -1 1 -1
-1 1 -1 -1-1 11 -1-1
-1-1 11 -1-1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1-1 11 -1-1
-1-1 -1-1 11 -1 1 -1
Color image
1 0 0 0 0 1
1 0 0 0 0 1
0 11 00 00 01 00 1
0 1 0 0 1 0
0 00 11 01 00 10 0
0 0 1 1 0 0
1 00 00 10 11 00 0
1 0 0 0 1 0
0 11 00 00 01 10 0
0 1 0 0 1 0
0 00 11 00 01 10 0
0 0 1 0 1 0
0 0 1 0 1 0
Color image RGB 3 channels

https://www.researchgate.net/post/How_will_channels_RGB_effect_convolutional_neural_network
Convolution v.s. Fully Connected

1 0 0 0 0 1 1 -1 -1 -1 1 -1
0 1 0 0 1 0 -1 1 -1 -1 1 -1
0 0 1 1 0 0 -1 -1 1 -1 1 -1
1 0 0 0 1 0
0 1 0 0 1 0
0 0 1 0 1 0
convolution
image

x1
1 0 0 0 0 1
0 1 0 0 1 0 x2
Fully-connected 0 0 1 1 0 0
1 0 0 0 1 0



0 1 0 0 1 0
0 0 1 0 1 0
x36
1 -1 -1 Filter 1 1 1
-1 1 -1 2 0
-1 -1 1 3 0
4 0 3
1 0 0 0 0 1 :


0 1 0 0 1 0 0
0 0 1 1 0 0 8 1
1 0 0 0 1 0 9 0
0 1 0 0 1 0 10: 0


0 0 1 0 1 0
13 0
6 x 6 image
14 0
fewer parameters! 15 1 Only connect to 9
16 1 inputs, not fully
connected

1 -1 -1 1 1
:2 0
-1 1 -1 Filter 1
:3 0
-1 -1 1
:4 0 3
1 0 0 0 0 1 :


0 1 0 0 1 0 7 0
0 0 1 1 0 0 :8 1
1 0 0 0 1 0 :9 0 -1
0 1 0 0 1 0 10:: 0


0 0 1 0 1 0
13 0
6 x 6 image
:
14 0
Fewer parameters :15: 1
16: 1 Shared weights
Even fewer parameters

The whole CNN

cat dog ……
Convolution

Max Pooling
Can repeat
many
Convolution times

Max Pooling

Flattened

Fully Connected
Feedforward network
Max Pooling
1 -1 -1 -1 1 -1
-1 1 -1 Filter 1 -1 1 -1 Filter 2
-1 -1 1 -1 1 -1

3 -1 -3 -1 -1 -1 -1 -1

-3 1 0 -3 -1 -1 -2 1

-3 -3 0 1 -1 -1 -2 1

3 -2 -2 -1 -1 0 -4 3
Why Pooling
 Subsampling pixels will not change the object
bird
bird

Subsampling

We can subsample the pixels to make image smaller


fewer parameters to characterize the image
A CNN compresses a fully
connected network in two ways:
 Reducing number of connections
 Shared weights on the edges
 Max pooling further reduces the complexity
Max Pooling

New image
1 0 0 0 0 1 but smaller
0 1 0 0 1 0 Conv
3 0
0 0 1 1 0 0 -1 1
1 0 0 0 1 0
0 1 0 0 1 0 Max 3 1
0 3
0 0 1 0 1 0 Pooling
2 x 2 image
6 x 6 image
Each filter
is a channel
The whole CNN

3 0
-1 1 Convolution

3 1
0 3
Max Pooling
Can repeat
A new image
many
Convolution times
Smaller than the original
image Max Pooling

The number of channels


is the number of filters
The whole CNN
cat dog ……
Convolution

Max Pooling

Fully Connected A new image


Feedforward network
Convolution

Max Pooling

Flattened A new image


3
Flattening
0

1
3 0
-1 1 3

3 1 -1
0 3 Flattened

1 Fully Connected
Feedforward network

3
CNN in speech recognition

The filters move in the


CNN frequency direction.
Frequency

Image Time
Spectrogram
Presentation

 How cnn can be used for text analysis


CNN in text classification

Source of image: http://citeseerx.ist.psu.edu/viewdoc/download?


doi=10.1.1.703.6858&rep=rep1&type=pdf
Recurrent Neural Network
Architectures
The fundamental feature of a Recurrent Neural Network (RNN) is that the network contains
at least one feed-back connection, so the activations can flow round in a loop.
That enables the networks to do temporal processing and learn sequences, e.g., perform
sequence recognition/reproduction or temporal association/prediction.
Recurrent neural network architectures can have many different forms. One common type
consists of a standard Multi-Layer Perceptron (MLP) plus added loops. These can exploit the
powerful non-linear mapping capabilities of the MLP, and also have some form of memory.
Others have more uniform structures, potentially with every neuron connected to all the
others, and may also have stochastic activation functions.
For simple architectures and deterministic activation functions, learning can be achieved
using similar gradient descent procedures to those leading to the back-propagation algorithm
for feed-forward networks. When the activations are stochastic, simulated annealing
approaches may be more appropriate.
A Fully Recurrent Network

 The simplest form of fully recurrent neural


network is an MLP with the previous set of
hidden unit activations feeding back into
the network along with the inputs:
 Note that the time t has to be discretized,
with the activations updated at each time
step. The time scale might correspond to
the operation of real neurons, or for
artificial systems any time step size
appropriate for the given problem can be
used. A delay unit needs to be introduced to
hold activations until they are processed at
the next time step.
Why sequence models
Sequence Models like RNN and LSTMs have greatly transformed learning on sequences in the past few years.

 Examples of sequence data in applications:


Machine translation (sequence to sequence):
 Speech recognition (sequence to sequence): X: text sequence (in one language)
 Y: text sequence (in other language)
X: wave sequence
Video activity recognition (sequence to one):
 Y: text sequence X: video frames
Y: label (activity)
 Music generation (one to sequence):
Name entity recognition (sequence to sequence):
 X: nothing or an integer X: text sequence
Y: label sequence
 Y: wave sequence
Can be used by seach engines to index
 Sentiment classification (sequence to one): different type of words inside a text.
 X: text sequence
 Y: integer rating from one to five
 DNA sequence analysis (sequence to
sequence):
 X: DNA sequence
 Y: DNA Labels
RNN
 In this problem Tx = Ty. In other problems where they aren't equal, the
RNN architecture may be different.
 a<0> is usually initialized with zeros, but some others may initialize it
randomly in some cases.
 There are three weight matrices here: Wax, Waa, and Wya with shapes:
 Wax: (NoOfHiddenNeurons, nx)

 Waa: (NoOfHiddenNeurons, NoOfHiddenNeurons)

 Wya: (ny, NoOfHiddenNeurons)

 The weight matrix Waa is the memory the RNN is trying to


maintain from the previous layers.
Rnn: computational graph

The forward propagation equations on the discussed architecture are


RNN

 Notice: the same function and the same set of parameters are used at every
time step
One to many

Music Generation e.g. Image Captioning


Genre -> Sequence of Music image -> sequence of words
Many to one

e.g. Sentiment Classification Movie Review


sequence of words -> sentiment
Many to many

Decoder

Encoder

e.g. Machine Translation


seq of words -> seq of words
Many to many

e.g. Video
classification on
frame level
RNN forward pass
Basic RNN backward pass
Bidirectional RNN
 In the discussed RNN architecture, the current output ŷ <t> depends
on the previous inputs and activations.
 Let's have this example 'He Said, "Teddy Roosevelt was a great
president"'. In this example Teddy is a person name but we know
that from the word president that came after Teddy not from He
and said that were before it.
 So limitation of the discussed architecture is that it can not learn
from elements later in the sequence. To address this problem
Bidirectional RNN (BRNN) are introduced.

y(1) y(2) y(t-2) y(t-1) y(t)


g (1) g (2) … g(t-2) g (t-1) g(t)
h (1) h (2) … h (t-2) h (t-1) h(t)

x(1) x(2) x(t-2) x(t-1) x(t)


Credits

 Andrew NG, Stanford university

 David Corne, Heriot-Watt University, Edinburg


 Artificial Neural Networks and Deep Learning by Christian Borgelt
University of Konstanz, Germany
 Deep learning by Ming Li, Canada Research Chair in Bioinformatics,
University of Waterloo, Waterloo.
 Deeplearning.ai ; Deep learning specialization Sequence Modeling by
Andrew Ng
 Stanford Convolutional Neural Networks for Visual Recognition Course by
Fei Fei Li
 https://hackernoon.com/rnn-or-recurrent-neural-network-for-noobs-
a9afbb00e860

You might also like