Unit 03 - Neural Networks - MD
Unit 03 - Neural Networks - MD
Neural networks are really composed of very simple units that are akin to linear
classification or regression methods.
We will consider:
8.2. Objectives
At the end of this lecture, you will be able to
1 of 24 04/04/2020, 01:05
Compute the output of a simple neural network possibly with hidden layers
given the weights and activation functions .
Determine whether data after transformation by some layers is linearly
separable, draw decision boundaries given by the weight vectors and use
them to help understand the behavior of the network.
8.3. Motivation
The topic of feedforward neural networks is split into two parts:
The key difference between neural networks and the methods we saw in unit 2 is
that, there, the mapping from x to ϕ(x) in not part of the data analysis, it is done
ex-ante (even, although implicitly, with the choice of a particular kernel), and
then we optimise once we chose ϕ.
Note that we have a bit of a chicken and egg problem here, as, in order to do a
good classification - understand and learn the parameters theta for the
classification decision - we would need to know what that feature representation
is. But, on the other hand, in order to understand what a good feature
representation would be, we would need to know how that feature
representation is exercised in the ultimate classification task.
So far, the ways we have performed non-linear classification involve either first
2 of 24 04/04/2020, 01:05
mapping x explicitly into some feature vectors ϕ(x), whose coordinates involve
non-linear functions of x, or in order to increase computational efficiency,
rewriting the decision rule in terms of a chosen kernel, i.e. the dot product of
feature vectors, and then using the training data to learn a transformed
classification parameter.
However, in both cases, the feature vectors are chosen. They are not learned in
order to improve performance of the classification problem at hand.
Neural networks, on the other hand, are models in which the feature
representation is learned jointly with the classifier to improve classification
performance.
take a signal in input using dendroids that connect the neuron to ~ 10^3 _
10^4 other neurons
the signal potential increases in the neuron’s body
once a threshold is reached, the neuron in turn emits a signal that, through
its axon, reaches ~ 100 other neurons
A bit of terminology:
linear. Used typically at the very end, before measuring the loss of the
predictor;
relu (“rectified linear unit”): f(z) = max{0, z}
tanh (“hyperbolic tangent”): it mimics the sine function but in a soft way: it
is a sigmoid curve spanning from -1 (at z = −∞) to +1 (at z = +∞), with a
value of 0 at z = 0. Its smoothness property is useful since we have to
propagate the training signal through the network in order to adjust all the
3 of 24 04/04/2020, 01:05
ez −e−z
parameters in the model. tanh(z) = ez +e−z = 1 − 2z2 . Note that
e +1
tanh(– z) =– tanh(z) .
Our objective will be to learn the weights that make this node, in the context of
the whole network, to then function appropriately, to behave well.
Overall architecture
In deep forward neural networks, neural network units are arranged in
layers, from the input layer, where each unit holds the input coordinate,
through various hidden layer transformations, until the actual output of the
model:
In this layerwise computation, each unit in a particular layer takes input from all
the preceding layer units. And it has its own parameters that are adjusted to
perform the overall computation. So parameters are different even between
different units of the same layer. A deep (feedforward) neural network refers
hence to a neural network that contains not only the input and output layers, but
also hidden layers in between.
For example the input could be an image, so the input vector is the individual
pixel content, from which the first layer try to detect edges, then these are
recombined into parts (in subsequent layers), objects and finally
characterisation of a scene: edges -> simple parts-> parts -> objects -> scenes
4 of 24 04/04/2020, 01:05
One of the main advantages of deep neural networks is that in many cases, they
can learn to extract very complex and sophisticated features from just the raw
features presented to them as their input. For instance, in the context of image
recognition, neural networks can extract the features that differentiate a cat
from a dog based only on the raw pixel data presented to them from images.
The initial few layers of a neural networks typically capture the simpler and
smaller features whereas the later layers use information from these low-level
features to identify more complex and sophisticated features.
Note: it is interesting to note that a neural network can represent any given
binary function.
Subject areas
Deep learning has overtaken a number of academic disciplines in just a few
years:
personalized/automated medicine
chemistry, robotics, materials science, etc.
contrary to small, rigid models, large, richer and flexible models can be
5 of 24 04/04/2020, 01:05
successfully estimated even with simple gradient based learning algorithms
like stochastic gradient descent.
We can think about the notions of width (number of units in a layer) and depth
(number of layers). Small models (low width and low depth) are quite rigid and
don’t allow for a good abstraction of reality i.e. learning the underlying
structures based on observations. Large models can use more width and more
depth to generate improved abstractions of reality i.e. improved learning.
See also the conclusion of the video on the next segment: “Introducing
redundancy will make the optimization problem that we have to solve easier.”
We can see each layer (one in this case) as a “box” that takes as input x and
return output f , mediated trough its weights W, and the output layer as those
box taking f as input and returning the final output f , mediated trough its
weights W′
What these hidden units are actually doing ? Since they are like linear
classifiers, they take linear combination of the inputs, pass through that
nonlinear function, we can also visualize them as if they were linear classifiers
with norm equal to w.
The difference is that instead of having a binary output (like in sign(w ⋅ x)) we
have now f(w ⋅ x). f typically takes the form of tanh(w ⋅ x) for the hidden
layers that we will study, whose output range is (-1,+1). More the point is far
from the boundary, more tanh() move toward the extremes, with a speed that is
proportional to the norm of w.
If we have (as it is normally) multiple nodes per layer, we can thing on a series of
linear classifiers on the same depthi−1 space, each one identified by its norm
w1 , w2 , . . . , wdepthi .
6 of 24 04/04/2020, 01:05
Given a neural network with one hidden layer for classification, we can view the
hidden layer as a feature representation, and the output layer as a classifier
using the learned feature representation.
There’re also other parameters that will affect the learning process and the
performance of the model, such as the learning rate and parameters that control
the network architecture (e.g. number of hidden units/layers) etc. These are
often called hyper-parameters.
2-D Example
Let’s consider as example a case in 2-D where we have a cloud of negative
points (red) in the bottom-left and top-right corners and a cloud of positive
points (blue) in the center, like in the following chart (left side):
The chart on the right depicts the same points in the space resulting from the
application of the two classifiers, using just a linear activation, i.e. the dot
product between w and x.
The appearance that it draws exactly as a line derives from the fact that the two
planes are parallel, but the general idea is that still we have a problem that is
not separable, as any linear transformation of the feature space of a linearly in-
separable classification problem would still continue to remain linearly
inseparable.
7 of 24 04/04/2020, 01:05
This is really where the power of the hidden layer lies. It gives us a
transformation of the input signal into a representation that makes the problem
easier to solve.
Finally we can try the ReLU activation (f(z) = max{0, z}): in this case the
output is not actually strictly linearly separable.
So this highlights the difficulty of learning these models. It’s not always the case
that the same non-linear transformation casts them as linearly separable.
However if we flip the planes directions, both the tanh and the ReLU activation
results in outputs that become linearly separable (for the ReLU, all positive
points got mapped to the (0,0) point).
What if we chose the two planes as random (i.e. w1 and w2 are random) ? The
resulting output would likely not be linearly separable. However if we introduce
redundancy, e.g. using 10 hidden units for an original two dimensional problem,
the problem would likely become linearly separable even if these 10 planes are
chosen at random. Notice this is quite similar to the systematic expansion that
we did earlier, in terms of polynomial features.
So introducing redundancy here is actually helpful. And we’ll see how this is
helpful also when we are actually learning these hidden unit representations
from data. Introducing redundancy will make the optimization problem that we
have to solve easier.
Summary
Units in neural networks are linear classifiers, just with different output non-
linearity
The units in feed-forward neural networks are arranged in layers (input, one
or plus hidden, output)
By learning the parameters associated with the hidden layer units, we learn
how to represent examples (as hidden layer activations)
The representations in neural networks are learned directly to facilitate the
end-to-end task
A simple classifier (output unit) suffices to solve complex classification tasks
if it operates on the hidden layer representations
The outward layer is their prediction that we actually want. And the role of the
hidden layers is really to adjust their transformation, adjust their computation in
such a way that the output layer will have an easier task to solve the problem.
8 of 24 04/04/2020, 01:05
The next lecture will deal with actually learning these representations together
with the final classifier.
9.1. Objectives
At the end of this lecture, you will be able to
9 of 24 04/04/2020, 01:05
∂L(f(X;W),Y )
wli,j ← wli,j − η ∗ ∂wi,j
The question turns now on how do we evaluate such gradient, as the mapping
from the weight to the final output of the network can be very complicated (so
much for the weights in the first layers!).
Let’s take as example a deep neural network with a single unit per node, with
both input and outputs as scalars, as in the following diagram:
Let’s also assume that the activation function is tanh(z), also in the last layer (in
reality the last unit is often a linear function, so that the prediction is in R and
not just in (−1, +1)), that there is no offset parameter and that the specific loss
function for each individual example is Loss = 12 (y − fL )2 .
Then, for such network, we can write z1 = xw1 and, more in general, for
i = 2, . . . , L: zi = fi−1 wi where fi−1 = f(zi−1 ).
∂Loss
∂w1
= [(1 − tanh2 (xw1 ))x] ∗ [(1 − tanh2 (f1 w2 ))w2 ] ∗ [(1 − tanh2 (f2 w3 ))w3 ]∗. . . [(1 −
Now based on the nature the calculator, the fact that we evaluate the loss at the
very output, then multiply by these Jacobians also highlights how this can go
wrong. Imagine if these Jacobians here, the value is the derivatives of the layer-
wise mappings, are very small. Then the gradient vanishes very quickly as the
depth of the architecture increases. If these derivatives are large, then the
gradients can also explode.
So there are issues that we need to deal with when the architecture is deep.
10 of 24 04/04/2020, 01:05
When the problem is however complex, a neural network with just the capacity
in terms of number of nodes that would be theoretically enough to find a
separable solution (classify all the example correctly) may not actually arrive to
such optimal classification. More in general, for multi-layer neural networks,
stochastic gradient descent (SGD) is not guaranteed to reach a global optimum
(but they can find a locally optimal solution, which is typically quite good).
Using overcapacity however can lead to artefacts in the classifiers. To limit these
artefacts and have good models even with overcapacity one can use two tricks:
We will later talk about regularization – how to actually squeeze the capacity of
these models a little bit, while in terms of units, giving them overcapacity.
Summary
Neural networks can be learned with SGD similarly to linear classifiers
The derivatives necessary for SGD can be evaluated effectively via back-
propagation
Multi-layer neural network models are complicated. We are no longer
guaranteed to reach global (only local) optimum with SGD
Larger models tend to be easier to learn because their units only need to be
adjusted so that they are, collectively, sufficient to solve the task
10.1. Objective
Introduction to recurrent neural networks (RNNs)
11 of 24 04/04/2020, 01:05
Understand the process of encoding of RNNs in modeling sequences.
This lecture introduces the topic: the problem of modelling sequences, what are
RNN, how they relate to the feedforward neural network we saw in the previous
lectures and how to encode sequences into vector representations. Next lecture
will focus on how to decode such vectors so that we can use them to predict
properties of the sequences, or what comes next in the sequence.
We already saw how to solve this kind of problem with linear predictors or
feedforward neural networks.
In both case the first task is to compile a feature vector, for example of the
values of the exchange rate at various times, for example, at time t − 1 to t − 4
for a prediction at time t.
12 of 24 04/04/2020, 01:05
need. For example, sentiment analysis, language translation, and next word
suggestion all requires a different feature representation as they focus on
different parts of the sentence: while sentiment analysis focuses on the holistic
meaning of a sentence, translation or next word suggestion focuses instead more
on individual words.
Note that very different types of objects can be encoded in feature vectors, like
images, events, words or videos, and once they are encoded, all these different
kind of objects can be used together.
In other words, RNNs can not only process single data points (such as images),
but also entire sequences of data (such as speech or video).
While this lecture deals with encoding, the mapping of a sequence to a feature
vector, the next lecture deals with decoding, the mapping of a feature vector to
a sequence.
One simple implementation of each layer transformation (that here we can see it
as an “update”) is hence (omitting the offset parameters):
Where:
st−1 is a m × 1 vector of the old “context” or “state” (the data coming from
the previous layer)
W s,s is a m × m matrix of the weights associated to the existing state,
whose role is to deciding what part of the previous information should be
keep (and note that this is not changing in each layer.). Can also be
interpreted as giving how the state would evolve in absence of any new
information.
xt is a d × 1 feature representation of the new information (e.g. a new word)
W s,x is a m × m weights whose role is deciding how to take into account
the new information, so that the result of wx multiplication is specific to
each new information arriving;
tanh(⋅) is the activation function (to be applied elementwise)
st is a m × 1 vector of the new “context” or “state” (the updated state with
the new information taken into account)
13 of 24 04/04/2020, 01:05
RNN have hence a number of layers equal to the data in the sequence, like the
words in a sentence. So there is a single, evolving, NN for the whole sequence
rather than a different NN for each element of the sequence. The initial state
(S0 ) is a vector of m zeros.
Note that this parametric approach let the way we introduce new data ( W s,x ) to
adjust to the way we use the network, i.e. to be learned according to the specific
problem on hand…
Learning RNNs
Learning a RNN is similar to learning a feedforward neural network: the first
task is to define a Loss function and then find the weights that minimise it, for
example using a SGD algorithm where the gradient with respect to the weights
is computed using back-propagation. The fact that the parameters are shared in
RNN’s means that we add the contribution of each of the suggested
modifications for parameters at each of these positions where the
transformation is flagged. One problem of simple RNN models is that the
gradient can vanish or explode. Here even more than in feedforward NN, as the
sequences can be quite long and we apply this transformation repeatedly. In real
cases, the design of the transformation can be improved to counter this issue.
1
gt = sigmoid(W g,s st−1 + W g,x xt ) = g,s s g,x x )
1+e−(W t−1 +W t
Where the first equation defines a gate responsible to filter in the new
information (trough its own parameters to be learn as well) and results in a
continuous value between 0 and 1.
The second equation then uses the gate to define, for each individual value of
the state vector (the ⊙ symbol stands for element-wise multiplication), how
much to retain and how much to update with the new information. For example,
if gt [2] (the second element of the gate at the t transformation) is 0 (an extreme
value!), it means we keep in that transformation the old value of the state for the
second element, ignoring the transformation deriving from the new information.
14 of 24 04/04/2020, 01:05
Long Short Term Memory neural networks
Real in-use gated RNN are even more complicated. In particular, Long Short
Term Memory (recursive neural) networks (shortered as LSTM) are well-suited
to classifying, processing and making predictions based on time series data,
since there can be lags of unknown duration between important events in a time
series and have the following gates defined:
The input, forget, and output gates control respectively how to read information
into the memory cell, how to forget information that we’ve had previously, and
how to output information from the memory cell into a visible form.
The memory cell update is helpful to retain information over a longer sequences.
But we keep his memory cell hidden and instead only reveal the visible portion
of this tape. And it is this ht then at the end of the whole sequence applying this
box along the sequence that we will use as the vector representation for the
sequence.
On the LSTM topic one could also look at these external resources:
http://blog.echen.me/2017/05/30/exploring-lstms/
https://colah.github.io/posts/2015-08-Understanding-LSTMs/
https://machinelearningmastery.com/handle-long-sequences-long-short-term-
memory-recurrent-neural-networks/
15 of 24 04/04/2020, 01:05
Key things
Neural networks for sequences: encoding
RNNs, unfolded
state evolution, gates
relation to feed-forward neural networks
back-propagation (conceptually)
Issues: vanishing/exploding gradient
LSTM (operationally)
11.1. Objective
From Markov model to recurrent neural networks (RNNs)
Outline
Modeling sequences: language models
Markov models
as neural networks
hidden state, Recurrent Neural Networks (RNNs)
Example: decoding images into sentences
While in the last lesson we saw how to transform a sentence into a vector, in a
parametrized way that can be optimized for what we want the vector to do,
today we’re going to be talking about how to generate sequences using
recurrent neural networks (decoding). For example, in the translation domain,
how to unravel the output vector as a sentence in an other language.
16 of 24 04/04/2020, 01:05
Markov models
One way to implement prediction in a sequence is to just first define
probabilities for any possible combination looking at existing data (for example,
in a next word prediction - “language modelling”, look at all two-word
combinations in a serie of texts) and then, once we observe a case, sample the
next element from this conditional (discrete) probability distribution. This is a
first order Markov model.
We first learn the probability of any pair of words from data (“corpus”). For
practical reasons, we consider a vocabulary V of the n more frequent words
(and symbols), labelling all the others as UNK (a catch-all for “unknown”). To this
vocabulary we also add two special symbols <beg> and <end> to mark
respectively the beginning and the ending of the sentence. So a pair (<beg>,w)
would represent a word w ∈ V starting the sentence and (w,<end>) would
represent the world w ending the sentence.
We can now estimate the probability for each words w and w′ ∈ V that w′
follows w , that is the conditional probability that next word is w′ given the
previous one was w , by a normalised counting of the successive occurrences of
the pair (w, w′ ) (matching statistics):
count(w,w′ )
P^(w′ |w) = ∑w ∈V count(w,wi )
i
(we could be a bit smarter and consider at the denominator only the set of pairs
whose first element is w rather than the whole V ∈ n2 )
For a bigram we would obtain a probability table like the following one:
At this point we can generate a sentence by each time sampling from the
conditional probability mass function of the last observed word, starting with
<beg> (first row in the table) and until we sample a <end>.
We can use the table also to define the probability of any given sentence by just
using the probability multiplication rule. The probability of any N words
N
sentence (including <beg> and <end>) is then P = ∏i=2 P (wi |wi−1 ) . For
example, given the above table, the probability of the sentence “Course is great”
would be 0.1 ∗ 0.7 ∗ 0.6 ∗ 0.2.
17 of 24 04/04/2020, 01:05
Note that, using the counting approach, the model maximise the probability that
the generated sequences correspond to the observed sequences in the corpus,
i.e. counting corresponds here to the Maximum Likelihood Estimation of the
conditional probabilities. Also, the same approach can be used to model words
characted by character rather than sentences world by world.
Outline detailed
In this segment we considered language modelling, using the very simple
sequence model called Markov model. In the next segments of this lecture we’re
going to turn those Markov models into a neural network models, first as feed-
forward neural network models. And then we will add a hidden state and turn
them into recurrent neural network models. And finally, we’ll just consider
briefly an example of unraveling a vector representation, in this case, coming
from an image to a sequence.
We start by a one-hot encoding of the words, i.e. each input word would activate
one unique node on the input layer of the network (x). In other words, if the
vocabulary is composed of K words, the input layer of the neural network has
width K, and it is filed all with zeros except for the specific word encoded as 1.
We want as output of the neural network the PMF conditional to that specific
word in the input. So the output layer has too one unit for each possible word,
returning each node the probability that the next word is that specific one given
that the previous one was those encoded in x: Pk = P (wi = k|wi−1 )
Given the weights of this neural network W (not to be confused with the words
w ), the argument of the activation function of each node of the output layer is
zk = x ⋅ Wk + W0,k .
These z are real numbers. To transform in probabilities (all positive, sum equal
to 1) we use as activation function the non-linear Softmax function:
eZ k
Pk = K
∑j=1 eZj
We can use as input of the neural network a vector x coposed of the one-hot
econding of the previous world, plus the vector of the one-hot encoding of the
second previous word, obtaining the probability that the next word is wk
conditional to the two preceding words (roughly similar to a second order
Markov model).
18 of 24 04/04/2020, 01:05
Further, we could also insert an hidden layer in between th input and output
layers, in order to look at more complex combinations of the preceding two
words, in terms of how they are mapped to the probability values over the next
word.
Parsimony
For the tri-gram model above, the neural network would need a weight matrix
W of 2K × K parameters, i.e; a total og 2K 2 parameters.
The framework is similar to the RNN we saw in the previous lesson, with a state
(initially set to 0 ∈ RK ) that is updated, at each new information, with a function
of the previous state and the new observed word (typically with something like
st = tanh(W s,s st−1 + W s,w xt ) ). Note that xt is the one hot encoding of the new
observed word.
The difference is that in addition, at each step, we have also an output that
transforms the state in probability (representing the new, conditional PMF):
pt = softmax(W 0 st ) .
Note that the state here retains information of all the history of the sentence,
hence the probability is conditional to all the previous history in the sentence.
In other words, while W s,s and W s,w role is to select and encode the relevant
features from the previous history and the new data respectively, W 0 role is to
extract the relevant features from the memorised state with the aim of making a
prediction.
Finally we can have more complex structures, like LSTM networks, with forget,
input and output gates and the state divided in a memory cell and a visible state.
Also in this case we would however have a further transformation that output
19 of 24 04/04/2020, 01:05
the conditional PMF as function of the visible state (pt = softmax(W 0 ht )).
Note that the training phase is done computing the average loss at the level of
sentences, i.e. our labels are the full sentences, and are these that are compared
in the loss function with those obtained by sampling the PMFs resulting from the
RNN. While in training we use the true words specified as input for the next time
step, in testing instead we let the rural network to predict the sentence on its
own, using the sampled output at one time step as the input for the next step.
The only difference is that the initial state is not a vector of zero, but the
“encoded sentence”. We just start with that state and the <beg> symbol as x and
then let the RNN produce a PDF, sample from that and use that sampled data as
the new x for the next step and so on until we sample an <end>.
We don’t just have to take a vector from a sentence and translate it into another
sentence. We can take a vector of representation of an image that we have
learned or are learning in the same process and translate that into a sentence.
Key summary
Markov models for sequences
how to formulate, estimate, sample sequences from
RNNs for generating (decoding) sequences
relation to Markov models (how to translate Markov models into RNN )
evolving hidden state
sampling from the RNN at each point
Decoding vectors into sequences (once we have this architecture that can
generate sequences, such as sentences, we can start any vector coming
from a sentence, image, or other context and unravel it as a sequence, e.g.
as a natural language sentence.)
Homework 4
20 of 24 04/04/2020, 01:05
12.1. Objectives
At the end of this lecture, you will be able to
In these networks the layer l is obtained by operating over the image at layer
l − 1 a small filter (or kernel) that is slid across the image with a step of 1
(typically) or more pixels at the time. The step is called stride, while the whole
process of sliding the filter can be mathematically seen as a convolution.
So, while we slide the filter, at each location of the filter, the output is composed
of the dot product between the values of the filter and the corresponding
location in the image (both vectorised), where the values of the filters are the
weigths that we want to learn, and they remain constant across the sliding. If
our filter is a 10 × 10 matrix, we have only 10 weights to learn by layer (plus one
for the offset). As in feedforward neural network then the dot product undergo
an activation function, here typically the ReLU function (max(0, x)).
21 of 24 04/04/2020, 01:05
⎡1 1 2 1 1⎤
⎢3 1 4 1 1⎥
⎢ ⎥
For example, given an image x = ⎢ ⎥
⎢ 1 3 1 2 2 ⎥ and filter weights
⎢ ⎥
⎢1 2 1 1 1⎥
⎣1 1 2 1 1⎦
⎡ 1 −2 0 ⎤ ⎡ 8 −3 6 ⎤
w=⎢ 1 0 1 ⎥, then the output of the filter z would be ⎢ 4 −3 5 ⎥.
⎣ −1 1 0 ⎦ ⎣ −3 5 −2 ⎦
For example, the element of this matrix z2,3 = 5 is the result of the sum of the
⎡4 1 1⎤
scalar multiplication between x = ⎢ 1 2 2 ⎥ and w .
′
⎣1 1 1⎦
⎡8 0 6⎤
Finally, the output of the layer would be (using ReLU) ⎢ 4 0 5 ⎥.
⎣0 5 0⎦
You can notice that we obtain a dimensionality reduction applying the filter, that
depends on the dimension of the filter and the stride (sliding step). In order to
avoid this a padding of zeros can be applied to the image in order to keep the
same dimensions in the output (in the above example a padding of one zeros on
both sides - and both dimensions - would suffice).
Because the weight of the filters are the same, it doesn’t really matter where the
object is learned, in which part of the image. The lecture explains this with a
mushroom example: ff the mushroom is in a different place, in a feed-forward
neural network the weight matrix parameters at that location need to learn to
recognize the mushroom anew. With convolutional layers, we have translational
invariance as the same filter is passed over the entire image. Therefore, it will
detect the mushroom regardless of location.
Keeping the output of the above example as input, a pooling layer with a 2 × 2
8 6
filter and a stride of 1 would result in [ ].
5 5
In a typical CNN, these convolutional and pooling layers are repeated several
times, where the initial few layers typically would capture the simpler and
22 of 24 04/04/2020, 01:05
smaller features, whereas the later layers would use information from these low-
level features to identify more complex and sophisticated features, like
characterisations of a scene. The learned weights would hence specialise across
the layers in a sequence like edges -> simple parts-> parts -> objects -> scenes.
Concerning the topic of CNN, see also the superb lecture of Andrej Karpathy on
YouTube (here or here).
Typically one single layer is formed by applying multiple filter, not just one. This
is because we want to learn different kind of features… for example one filter
will activated to catch vertical lines in the image, an other obliques ones… and
maybe an other different colours. And by the way the image and the filter have
normally a further dimensions to account for colour (typically of size 3):
96 convolutional filters on the first layer (filters are of size 11x11x3, applied
across input images of size 224x224x3). They all have been learned from random
initialisation.
So in each layer we are going to map the original image into multiple feature
maps where each feature map is generated by a little weight matrix, the filter,
that defines the little classifier that’s run through the original image to get the
associated feature map. Each of these feature maps defines a channel for
information.
I can then combine these convolution, looking for features, and pooling,
compressing the image a little bit, forgetting the information of where things
are, but maintaining what is there.
These layers are finally followed by some “normal”, “fully connected” layers (à la
feed-forward neural network) and a final softmax layer indicating the probability
that each image represents one of the possible categories (there could be
thousand of them).
The best network implementation are tested in so called “competitions”; like the
yearly ImageNet context.
Note that we can train this networks exactly like for feedformard NN, defining a
loss function and finding the weights that minimise the loss function. In
particular we can apply the stochastic gradient descendent algorithm (with a
few tricks based on getting pairs of image and the corresponding label), where
the gradient with respect to the various parameters (weights) is obtained by
backpropagation.
Take home
Understand what a convolution is
23 of 24 04/04/2020, 01:05
Understand the pooling, that tries to generate a slightly more compressed
image forgetting where things are, but maintaining information about
what’s there, what was activated.
∞
Cross-correlation: (f ∗ g)(t) := ∫−∞ f(τ)g(t + τ)dτ (i.e. the g function is not
reversed)
f: 1 2 3
...
0| 1 2
h(t=0) 2| 1 2
h(t=1) 5| 1 2
h(t=2) 8| 1 2
h(t=3) 3| 1 2
h(t=4) 0| 1 2
...
Graphically it is the sliding of the filter (first row, left to right, second row, left to
right, …) that we saw in the previous segment.
24 of 24 04/04/2020, 01:05