0% found this document useful (0 votes)
2 views

convolutional_neural_networks

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, advantages over traditional Multi-Layer Perceptrons (MLPs), and historical development. It explains the convolution operation, its properties, and how CNNs utilize filters to detect features in grid-like data such as images. The document also highlights the significant impact of CNNs in various applications, particularly in image processing and computer vision since their resurgence in 2012.

Uploaded by

Nizar Sahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

convolutional_neural_networks

The document provides an overview of Convolutional Neural Networks (CNNs), detailing their structure, advantages over traditional Multi-Layer Perceptrons (MLPs), and historical development. It explains the convolution operation, its properties, and how CNNs utilize filters to detect features in grid-like data such as images. The document also highlights the significant impact of CNNs in various applications, particularly in image processing and computer vision since their resurgence in 2012.

Uploaded by

Nizar Sahid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 108

Convolutional Neural Networks

Alasdair Newson
LTCI, Télécom Paris, IP Paris
[email protected]

A. Newson 1
Introduction

Neural networks provide a highly flexible way to model complex


dependencies and patterns in data

In the previous lessons, we saw the following elements :


MLPs : fully connected layers, biases
Activation functions : sigmoid, soft max, ReLU
Optimisation : gradient descent, stochastic gradient descent
Regularisation : weight decay, dropout, batch normalisation
RNNs : for sequential data

Fully connected Non-linearity

A. Newson 2
Introduction

In MLPs each layer of the network contained fully connected layers


Unfortunately, there are great drawbacks with such an approach
Fully connected
256 layer

256 1000

Each hidden unit is connected to each input unit


There is high redundancy in these weights :
In the above example, 65 million weights are required

A. Newson 3
Introduction

For many types of data with grid-like topological structures (eg.


images), it is not necessary to have so many weights
For these data, the convolution operation is often extremely useful
Reduces the number of parameters to train
Training is faster
Convergence is easier : smaller parameter space

A neural network with convolution operations is known as a


Convolutional Neural Network (CNN)
A. Newson 4
Introduction - some history

“Neocognitron” of Fukushima∗ : first to incorporate notion of


receptive field into a neural network, based on work on animal
perception of Hubert and Weisel†
Yann LeCun first to propose back-propagation for training
convolutional neural networks‡
Automatic learning of parameters instead of hand-crafted weights
However, training was very long : required 3 days (in 1990)


Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in
Position, Fukushima, K., Biological Cybernetics, 1980

Receptive fields and functional architecture of monkey striate cortex, Hubel, D. H. and Wiesel, T. N, 1968

Backpropagation Applied to Handwritten Zip Code Recognition, LeCun, Y. et al., AT&T Bell Laboratories

A. Newson 5
Introduction - some history

In the years 1998-2012, research continued on shallow and deep neural


networks, but other machine learning approaches were preferred
(GMMs, SVMs etc.)

In 2012, Alex Krizhevsky et al. used Graphics Processing Units


(GPUs) to carry out backpropagation on a very deep convolutional
neural network
Greatly outperformed classic approaches in the ImageNet Large Scale
Visual Recognition Challenge (ILSVRC)

GPUs turned out to be very efficient for training neural nets (lots of
parallel computations)

Signalled the beginning of deep learning revolution

A. Newson 6
Introduction - some history

Since 2012, CNNs have completely revolutionised many domains


CNNs produce competetive/best results for most problems in image
processing and computer vision

Image classification

Image style transfer

Computer graphics

A Neural Algorithm of Artistic


Style, Gatys et al, CVPR 2015
Applications of deep learning
From AtlasNet, Groueix et al, CVPR, 2018

Image restoration
Medical imaging
Automatic speech recognition

Medical Image Classification with Convolutional


Neural Network, Li et al., ICARCV, 2014

Medical Image Classification with Convolutional


Neural Network, Li et a., ICARCV, 2014

Being applied to an ever-increasing number of problems

A. Newson 7
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 8
Introduction - some notation

Notations
x ∈ Rn : input vector
y ∈ Rq : output vector
u` : feature vector at layer `
θ` : network parameters at layer `

Neural network with L layers

A. Newson 9
Introduction

A “Convolutional Neural Network” (CNN) is simply a


concatenation of :
1 Convolutions (filters)
2 Additive biases
3 Down-sampling (“Max-Pooling” etc.)
4 Non-linearities

In this lesson, we will be mainly concentrating on convolutional and


down-sampling layers

A. Newson 10
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 11
Convolutional Layers

Convolution operator
Let f and g be two integrable functions. The convolution operator ∗
takes as its input two such functions, and outputs another function
h = f ∗ g, which is defined at any point t ∈ R as :
Z +∞
h(t) = (f ∗ g)(t) = f (τ )g(t − τ )dτ.
−∞

Intuitively, the function h is defined as the inner product between f


and a shifted version of g

A. Newson 12
Convolutional Layers

In many practical applications, in particular for CNNs, we use the


discrete convolution operator, which acts on discretised functions;

Discrete convolution operator


Let fn and gn be two summable series, with n ∈ Z. The discrete
convolution operator is defined as :
+∞
X
(f ∗ g)(n) = f (i)g(n − i)
i=−∞

Intuitively, the function h is defined as the inner product between f


and a shifted version of g
In practice, the filter is of small spatial support, around 3 × 3, or 5 × 5
Therefore, only a small number of parameters need to be trained (9
or 25 for these filters)
A. Newson 13
Convolutional Layers

Properties of convolution
1 Associativity : (f ∗ g) ∗ h = f ∗ (g ∗ h)

2 Commutativity : f ∗ g = g ∗ f

3 Bilinearity : (αf ) ∗ (βg) = αβ(f ∗ g), for (α, β) ∈ R × R

4 Equivariance to translation : (f ∗ (g + τ )) (t) = (f ∗ g)(t + τ )

A. Newson 14
Convolutional Layers

Associativity, commutativity
Associativity+commutativity implies that we can carry out convolution
in any order
There is no point in having two or more consecutive convolutions
This is true in fact for any linear map

Equivariance to translation
Equivariance implies that the convolution of any shifted input
(f + τ ) ∗ g contains the same information as f ∗ g †
This is useful, since we want to detect objects anywhere in the image


if we forget about border conditions for a moment

A. Newson 15
Convolutional Layers - 2D Convolution

Most often, we are going to be working with images


Therefore, we require a 2D convolution operator : this is defined in a
very similar manner to 1D convolution :

2D convolution operator
+∞
X +∞
X
(f ∗ g)(s, t) = f (i, j)g(s − i, t − j)
i=−∞ j=−∞

Important remarks for the rest of the lesson!


We are going to denote the filters with w
For lighter notation, we write w(i) =: wi (and the same for xi etc.)

A. Newson 16
Convolutional Layers : Visual Illustration

A. Newson 17
Convolutional Layers : Visual Illustration

A. Newson 18
Convolutional Layers : Visual Illustration

A. Newson 19
Convolutional Layers : Visual Illustration

A. Newson 20
Convolutional Layers : Visual Illustration

A. Newson 21
Convolutional Layers : Visual Illustration

A. Newson 22
Convolutional Layers : Visual Illustration

A. Newson 23
Convolutional Layers : Visual Illustration

A. Newson 24
Convolutional Layers : Visual Illustration

A. Newson 25
Convolutional Layers : Visual Illustration

A. Newson 26
Convolutional Layers : Visual Illustration

A. Newson 27
Convolutional Layers : Visual Illustration

A. Newson 28
Convolutional Layers : Visual Illustration

A. Newson 29
Convolutional Layers

The filter weights wi determine what type of “feature” can be


detected by convolutional layers;
Example, sobel filters :

Horizontal edge Vertical edge


-1 -2 -1 -1 0 1
" # " #
0 0 0 2 0 -2
1 2 1 -1 0 1

A. Newson 30
Convolutional Layers

Convolutional filters can also act as low-pass/smoothing filters

Input image Low-pass filtered image

A. Newson 31
Convolutional Layers

We can also write convolution as a matrix/vector product, as in the


case of fully connected layers
Example : discrete Laplacian operator
n
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−→
0 0
  

 4 −1 ··· −1 ···
0 0
−1 4 −1 · · · −1 ···
  
0 −1 0
!  
  0 0 
w= −1 4 −1 → Aw = K   0 −1 4 −1 ··· −1 · · ·
  
0 −1 0  .. .. .. .. .. .. 
  . . . . . . 
 0 0
y −1 ··· −1 ··· −1 4

This further illustrates the drastic reduction in weight parameters (9


instead of Kn)
Can be useful to view convolution in this manner (we will see this later)

A. Newson 32
Convolutional Layers

At this point, it is good to have a more “neural network”-based


illustration of CNNs

...
...

We can see two of the main justifications for CNNs


1 Sparse connectivity
2 Weight sharing

A. Newson 33
Convolutional Layers

Now that we understand convolution, how do we optimize a neural


network with convolutional layers ? Back-propagation
Consider a layer with just a convolution with w

∂L
We have the derivatives ∂yi available
We want to calculate the following quantities :
∂L
∂xk (for further back-propagation) and
∂L
∂wk
∂L
We shall use the abbreviation ∂yi =: dyi

A. Newson 34
Convolutional Layers

Before considering the general case, let’s take an example from the
illustration from above

...
...

∂L
Say we want to calculate dx1 := ∂x1

A. Newson 35
Convolutional Layers

Each element yi depends on the input xi and the weight wk


Therefore, we can consider that the loss is a function of several
variables :

L = f (x1 , . . . , xn , w1 , . . . , wK , y1 (x· , w· ), . . . , ym (x· , w· ))

We use the multi-variate chain rule

X ∂L ∂yi
dx1 =
i
∂yi ∂x1

A. Newson 36
Convolutional Layers

...
...

dx1 =???

A. Newson 37
Convolutional Layers

...
...

∂y0 ∂y1 ∂y2


dx1 = dy0 + dy1 + dy1 = dy0 c + dy1 b + dy2 a
∂x1 ∂x1 ∂x1

As we can see, the order of the weights is flipped


A. Newson 38
Convolutional Layers
∂L
Now, let us calculate ∂xk for any k

∂L X ∂yi
= dyi multi-variate chain rule
∂xk i
∂xk
X ∂(x ∗ w)i
= dyi
i
∂xk
P 
X ∂ j xj wi−j
= dyi
i
∂xk
X X
= dyi wi−k = dyi w−(k−i)
i i

More compactly : dxk = (dy ∗ flip(w))k


A. Newson 39
Convolutional Layers

Recall that the convolution operator can be written y = Aw x, with Aw


the convolution matrix
The flipping of the weights corresponds to a transpose of A

dx = Aw | dy (1)

This gives an easy method of backpropagation in convolutional layers


Although you will not actually have to implement this

A. Newson 40
Convolutional Layers
∂L
Now for the second part : ∂wk

...
...

P ∂L ∂yi
Again, we use the chain rule. For example da = i ∂yi ∂a

A. Newson 41
Convolutional Layers

...
...

We have yi = axi−1 + bxi + cxi+1

X
da = dyi xi−1
i

A. Newson 42
Convolutional Layers

In the general case, we have:

∂L X ∂yi
= dyi multi-variate chain rule
∂wk i
∂wk
X ∂(x ∗ w)i
= dyi
i
∂wk
P 
X ∂ j xj wi−j
= dyi
i
∂wk
X X
= dyi xi−k = dyi x−(k−i) k =i−j
i i

More compactly : dwk = (dy ∗ flip(x))k

A. Newson 43
Convolutional Layers

Note : optimisation of loss w.r.t one parameter wk involves entire


image

Weights are “shared” across the entire image

This notion of weight sharing is one of the main justifications of


using CNNs

In practice, we do not calculate dwk and dxk ourselves, we use the


automatic differentiation tools of Tensorflow, Pytorch etc.

A. Newson 44
Convolutional Layers - border conditions

The convolution operator poses a problem at the borders

Theoretically, we consider functions defined over an infinite domain,


but which have compact support

In reality, we only have finite vectors/matrices to work on

A. Newson 45
Convolutional Layers - border conditions

Two common approaches to border conditions

“VALID” approach “SAME” approach

Only take shift/dot products that do Keep output size m


not extend beyond Supp(u) Need to choose values outside of
Output size : m − |w| + 1 Supp(u) : zero-padding
0 0 0
0
0

A. Newson 46
2D+feature convolution

Several filters are used per layer, let us say K filters : {w1 , . . . , wK }

The resulting vectors/images are then stacked together to produce the


next layer’s input u`+1 ∈ Rm×n×K

u`+1 = [u ∗ w1 , . . . , u ∗ wK ]

Therefore, the next layer’s weights must have a depth of K. The 2D


convolution with an image of depth K is defined as
X
(u ∗ w)y,x = u(i, j, k) w(y − i, x − j, k)
i,j,k

Useful explanation : https: // towardsdatascience. com/


a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215

A. Newson 47
Convolutional layers

Illustration of several consecutive convolutional layers with different


numbers of filter

Each layer contains “image” with a depth, where each channel


corresponds to a different filter response
Each layer is a concatenation of several features : rich information

Useful explanation : https: // towardsdatascience. com/


a-comprehensive-introduction-to-different-types-of-convolutions-in-deep-learning-669281e58215

A. Newson 48
Convolutional layers - a note on Biases

A note on biases in neural networks : each output layer is


associated with one bias

There is not one bias per pixel

This is coherent with the idea of weight sharing (bias sharing)

A. Newson 49
Convolutional Layers

In many cases, we are primarily interested in detection;


We would like to detect objects wherever they are in the image

Formally, we would like to have some shift invariance property;


This is done in CNNs by using subsampling, or some variant :
Strided convolutions
Max pooling
We explain these now

A. Newson 50
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 51
Down-sampling and the
receptive field

A. Newson 52
The Receptive Field

Neural networks were initially inspired by the brain’s functioning


Hubel and Weisel† showed that the visual cortex of cats and monkeys
contained cells which individually responded to different small regions
of the visual field
The region which an individual cell responds to is known as the
“receptive field” of that cell


Receptive fields and functional architecture of monkey striate cortex, Hubel, D. H.; Wiesel, T. N, 1968 Illustration from :
http: // www. yorku. ca/ eye/ cortfld. htm

A. Newson 53
The Receptive Field

This idea was imitated in convolutional neural networks by adding


down-sampling operations

Convolution +
subsampling

Illustration from : Applied Deep Learning, Andrei Bursuc, https: // www. di. ens. fr/ ~lelarge/ dldiy/ slides/ lecture_ 7/

A. Newson 54
Strided convolution

Strided convolution is simply convolution, followed by subsampling

Subsampling operator (for 1D case)


Let x ∈ Rn . We define the subsampling step as δ > 1, and the subsampling
n
operator Sδ : Rn → R δ , applied to x, as
n
Sδ (x) (t) = x(δt), for t = 0 . . . −1
δ

A. Newson 55
Max pooling

Max pooling subsampling consists in taking the maximum value over


a certain region
This maximum value is the new subsampled value
We will indicate the max pooling operator with Sm

(
max
(
A. Newson 56
Max pooling

Back propagation of max pooling only passes the gradient through the
maximum

10 80 0
80
15 30 0 0

Max pooling Back propagation

A. Newson 57
Down-sampling

Conclusion : cascade of convolution, non-linearities and subsampling


produces shift-invariant classification/detection
We can detect Roger wherever he is in the image !

Convolution + non-linearity +max pooling

✓ ✓ ✓ ✓

A. Newson 58
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 59
Dilated Convolution

There is a variant of convolution called dilated convolution∗


Increase spatial extent of convolution without adding parameters
Add a space D between each point in the convolution

D=1 D=2 D=3

X
(u ∗ v)(y, x) = u(i, j, k)v(y − Di, x − Dj, k) (2)
i,j,k


Multi-Scale Context Aggregation by Dilated Convolution, Yu, F, Kolten, V, ICLR 2016

A. Newson 60
Locally connected layers / unshared convolution

We might wish for a mix of a dense layer and a convolutional layer


One possibility : locally-connected layers (sometimes called
“unshared convolution”)
Local connectivity but no weight sharing

...
...

Number of weights increases linearly with the number of pixels, rather


than quadratically (for MLPs)
A. Newson 61
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 62
How to build your CNN ?

How to build your CNN ?


We have looked at the following operations : convolutions, additive
biases, non-linearities

All of these elements make up convolutional neural networks

However, how do we put these together to create our own CNN ?


Architecture ?
Programming tools ?
Datasets ?

A. Newson 63
Architecture : vanilla CNN

Simple classification CNN architecture often consists of a feature


learning section
Convolution → biases → non-linearities → subsampling
This continues until a fixed subsampling is achieved

After this, a classification section is used


Fully connected layer → non-linearity

A. Newson 64
Architecture

Central question : how to choose number of layers ?

Complicated, very little theoretical understanding, currently a hot


topic of research

However : there are a few rules of thumb to follow


Receptive field of the deepest layer should encompass what we
consider to be a fundamental brick of the objects we are analysing

convolution,
subsampling etc.

Set number of layers and subsampling factors according to the problem

A. Newson 65
CNN programming frameworks

Caffe
Open source, developed by University of California, Berkley
Network created in separate specific files
Somewhat laborious to use, less used than other frameworks

Theano
Open source, created by the Université de Montréal
Unfortunately, to be discontinued due to strong competition

Tensorflow
Open source, developed by Google
Implements a wide range of deep learning functionalities, widely used

Pytorch
Open source, developed by Facebook
Implements a wide range of deep learning functionalities, widely used

A. Newson 66
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 67
MNIST dataset

MNIST is a dataset of 60,000 28 × 28 pixel grey-level images


containing hand-written digits
The digits are centred in the images and scaled to have roughly the
same size
Although quite a “simple” dataset, still used to display performance of
modern CNNs

A. Newson 68
Caltech 101

Produced in 2003, first major object recognition dataset


9,146 images, 101 object categories, each category contains between
40 and 800 images
Annotations exist for each image : bounding box for the object and a
human-drawn outline

A. Newson 69
ImageNet dataset

Dataset created in 2009 by researchers from Princeton unverisity


Very large dataset : 14,197,122 images, hand-annotated
Used for the ImageNet Large Scale Visual Recognition Challenge, an
annual benchmark competition for object recognition algorithms

A. Newson 70
LeNet (1989/1998)

Created by Yann LeCun in 1989, goal : to recognise handwritten digits


Able to classify digits with 98.9% accuracy, used by U.S. government
to automatically read digits

Illustration from : Gradient-based Learning Applied to Document Recognition, LeCun, Y. Bottou, L., Bengio, Y. and Haffner,
Proceedings of the IEEE, 1989

A. Newson 71
AlexNet (2012)

AlexNet : created by Alex Krizhevsky in 2012


Improved accuracy of ImageNet Large Scale Visual Recognition
Challenge competition by 10 percentage points (16.4%)
First truly deep neural network
Signaled beginning of dominance of deep learning in image processing
and computer vision

Illustration from : Imagenet classification with deep convolutional neural networks, Krizhevsky, A., Sutskever, I. and Hinton, G.
E, NIPS, 2012

A. Newson 72
GoogLeNet (2015)

In 2014/2015, Google introduced the “Inception” architecture/module


Major attempt at reducing total number of parameters
No fully connected layers, only convolutional
2 million instead of 60 million for AlexNet
Novel idea : have variable receptive field sizes in one layer

Going deeper with convolutions, Szegedy et al, CVPR, 2015

A. Newson 73
GoogLeNet (2015)

Created by Google in 2014, GoogLeNet is a specific implementation of


the “inception” architecture
6.6% test error rate on ImageNet (human error rate 5%)

Going deeper with convolutions, Szegedy et al, CVPR, 2015

A. Newson 74
VGG16 (2015)

VGG16 is a 16-layer network, with small receptive fields (3 × 3 filters,


with less subsampling)
Around 7.5% test error on ILSVRC

Very Deep Convolutional Networks for Large-Scale Image Recognition, Simonyan, K. and Zisserman, A., ICLR, 2015
Illustration from Mathieu Cord,
https: // blog. heuritech. com/ 2016/ 02/ 29/ a-brief-report-of-the-heuritech-deep-learning-meetup-5/
A. Newson 75
Summary of advances in CNNs
Network LeNet (1998) AlexNet GoogLeNet VGG16 (2015)
(2012) (2014)

Image size 28 × 28 256 × 256 × 3 256 × 256 × 3 224 × 224 × 3


Layers 3 8 22 16
Parameters 60,000 60 million 2 million 138 million

Evolution of CNN perfomance


25

20
Error on ILSVRC

15

10

0
2011 2012 2013 2015 2014 2015
SVM AlexNet AlexNet, bis VGG16 GoogleNetdeep ResNets

A. Newson 76
Image classification

As we mentioned before, CNNs make sense for data with grid-like


structures

In particular, images are most often the target of CNNs

Arguably the most common application of CNNs is to image


classification

Why is image classification important ? Closely linked to :


Object detection
Tracking
Image search (in large databases for example)

In recent years, the best performing classification algorithms have been


using neural networks

A. Newson 77
Image classification

Why is image classification difficult ?


Images can vary in size, shape, position
We need to deal with variable lighting conditions, occlusions etc.

Let us look at a standard CNN classificaton network

A. Newson 78
Image classification

We have input datapoints x, which we wish to classify into several,


predefined classes {ci }, i = 1 . . . K, where K is the number of classes

As we have seen, convolution, non-linearities, subsampling allow for


robust classification that is invariant to many perturbations

Vast majority of CNN classification networks follow this general


architecture

A. Newson 79
Image classification

Resiudal architectures : ResNET


ResNET∗ (2016) uses skip connections
to mitigate the vanishing gradient
problem
Similar to LSTM, except propagates
through network layers, rather than
time

Residual mechanism used in many


subsequent architectures

Latest residual archticture gives 87.54%


accuracy on ImageNet


Deep Residual Learning for Image Recognition, Kaiming, H. et al, CVPR, 2016
Illustration from https: // becominghuman. ai/ resnet-convolution-neural-network-e10921245d3d

A. Newson 80
Image classification

Attention mechanism in image networks


Recall the attention mechanism in RNNs : addresses problem of long
range dependency
Networks exist with attention only : transformer∗
Also used in image network architectures (usually self-attention)

Attention(Q, K, V ) = Softmax(QK T )V (3)

Q, queries: what is the importance of these elements


K, keys: we use these elements for comparison (weighting)
V , values: we use these to “reconstruct” the queries
Often Q, K, V are the same, image patches


Attention is all you need, Vaswani et al, NIPS, 2017

A. Newson 81
Image classification

Attention mechanism in image networks


Recall the attention mechanism in RNNs : addresses problem of long
range dependency
Networks exist with attention only : transformer∗
Also used in image network architectures (usually self-attention)

Attention(Q, K, V ) = Softmax(QK T )V (4)

This equation says that the attention is a weighted version of V


The weights are given by a softmax of the dot products between
patches in Q and those in K


Attention is all you need, Vaswani et al, NIPS, 2017

A. Newson 82
Image classification

Attention mechanism in image networks

K
Q
?

K K


Attention is all you need, Vaswani et al, NIPS, 2017

A. Newson 83
Image classification

Attention mechanism in image networks


Combined attention/convolution archtictures present the best
accuracies on ImageNeT (to date∗ )
CoAt-Net7: 90.88% accuracy on ImageNet


https://paperswithcode.com/sota/image-classification-on-imagenet

A. Newson 84
Image classification

We can also detect the position of objects in images


RNN∗ proposes a simple approach :
1 Propose a list of bounding boxes in the image
2 Pass the resized sub-images through a powerful classification network
3 Classify each sub-image with your favourite classifier

Many variants on this work (Fast R-NN, Faster R-CNN) etc.

∗ Rich feature hierarchies for accurate object detection and semantic segmentation, Girschik, R. et al. CVPR 2014

A. Newson 85
Motion estimation

Motion estimation is a central task for many image processing and


computer vision problems : tracking, video editing
Optical flow involves estimating a vector field (u, v) : R2 → R2
where each vector points to the displacement of pixel (x, y) from an
image I1 to I2

I1 (x, y) = I2 (x + u(x, y), y + v(x, y))

Optical flow

Illustration from : BriefMatch: Dense binary feature matching for real-time optical flow estimation, Eilertsen, G, Forssén, P-E,
Unger, J., Scandinavian Conference on Image Analysis, 2017

A. Newson 86
Motion estimation with CNNs

A major challenge of optical flow estimation is to handle both fine and


large-scale motions
This is difficult to do with classical, variational approaches
CNNs have this multi-scale architecture already built in
Example : FlowNet∗ uses this, first extracting meaningful features
from the images (in parallel) and then combining them to create the
optical flow


FlowNet: Learning Optical Flow with Convolutional Networks, Fischer et al, ICCV 2015

A. Newson 87
Super-resolution

Image super-resolution : go from a low-resolution image to a


higher-resolution one
Relatively straightforward approach with a CNN∗

Drawback, highly dependent on degradation used in lower-resolution


images in database


Learning a deep convolutional network for image super-resolution, Chao et al, ECCV 2014

A. Newson 88
Point clouds

CNNs require regular grids. Point cloud data are not in this format
Nevertheless, ways have been found to deal with this

ShapeNet∗ splits a volume up into


sub-regions that are processed by
CNNs
Each region is a Bernoulli random
variable representing the probability of
this voxel belonging to a shape
This general approach (using voxels) is
followed in many other approaches


3d shapenets: A deep representation for volumetric shapes, W. Zhirong et al. CVPR, 2015

A. Newson 89
Summary

1 Introduction, notation
2 Convolutional Layers
3 Down-sampling and the receptive field
4 CNN details and variants
5 CNNs in practice
6 Image datasets, well-known CNNs, and applications
Applications of CNNs
7 Interpreting CNNs
Visualising CNNs
Adversarial examples

A. Newson 90
Adversarial examples

As is often the case in deep learning, it is very difficult to understand


what is going on in CNNs

Much research is being dedicated to understanding these networks


Explainable AI (XAI) Darpa project∗

We discuss two topics related to interpretability


Visualising CNNs
Adversarial examples


https: // www. darpa. mil/ program/ explainable-artificial-intelligence

A. Newson 91
Visualising CNNs

We would like to understand what CNNs are learning


Unfortunately filters are difficult to interpret (especially deeper layers)

Layer 3 filters
Layer 1 filters
Therefore, much research has been dedicated to visualising CNNs

A. Newson 92
Visualising CNNs
Idea : “invert” CNN, find x to maximise the output of a certain layer
Understand what this layer is “seeing”
This is possible due to backpropagation

Basic CNN visualisation algorithm


Choose a layer ` to visualise
x0 ∼ N (0, 1)
For i = 1 . . . N
xi = xi−1 + λ∇x ku` (xi−1 )k Gradient ascent
Return xN

Gradient ascent

A. Newson 93
Visualising features

Generalisation: maximise response to a given filter response


Choose layer `, filter k and element (“pixel”) (i, j)
Random initialisation x0 , constrain norm of solution x

x̂ = arg max u`i,j,k


x
with k x k = ρ

Optimisation : gradient ascent


Erhan, Bengio, Courville, Vincent, Visualizing Higher-Layer Features of a Deep Network, University of Montreal, 2009

A. Newson 94
Visualising CNNs

More sophisticated approach: standard inverse problem with


regularisation
x̂ = arg minkf (x) − f0 k22 + λkxk22 + µk∇xk22 (5)
x


Mahendran and Vedaldi Understanding Deep Image Representations by Inverting Them, Conference on Computer Vision and
Pattern Recognition, 2014

A. Newson 95
Visualising CNNs

Layer 1 Layer 2 Layer 3 Layer 4


Maximisation of different activations applied to MNIST dataset


Erhan, Bengio, Courville, Vincent, Visualizing Higher-Layer Features of a Deep Network, University of Montreal, 2009

A. Newson 96
Visualising CNNs

Another approach of Simonyan et al.† proposes to see what images


correspond to what classes
Choose a class c, maximise the response of this class

x̂ = arg max f (x)c − λkxk22


x

Find an L2 -regularised image which maximises the score for a given


class c
Initialise with random input image x0


Simonyan, Vedaldi, Zisserman Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency
Maps, arXiv preprint arXiv:1312.6034, 2013

A. Newson 97
Visualising CNNs

Class model visualisation



Simonyan, Vedaldi, Zisserman Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency
Maps, arXiv preprint arXiv:1312.6034, 2013

A. Newson 98
Visualising CNNs

Similar idea with Inception architecture of Google : “Deep Dream”


Maximise a class from input image

Input image Maximising “dogs” category

Deepdream - a code example for visualizing neural networks, Mordvintsev, A., Olah, C. and Tyka, M., Google Research, 2015

A. Newson 99
Adversarial examples

We often get the impression that CNNs are the end all and be all of AI
Consistently produce state-of-the-art results on images
However, CNNs are not infallible : adversarial examples† !

How was this image created ???


Intriguing properties of neural networks, Szegedy, C. et al, arXiv preprint arXiv:1312.6199, 2013

A. Newson 100
Adversarial examples

Szegedy et al. propose† add a small perturbation r that fools the


classifier network f into choosing the wrong class c for x̂ = x + r

arg min|r|22 , s.t f (x + r) = c, x + r ∈ [0, 1]n


r

x̂ is the closest example to x s.t x̂ is classified as in class c


Minimisation with box-constrained L-BFGS algorithm


Intriguing properties of neural networks, Szegedy, C. et al, arXiv preprint arXiv:1312.6199, 2013

A. Newson 101
Adversarial examples

Common explanation : the space of images is very high-dimensional,


and contains many areas that are unexplored during training time

Example of loss surfaces in commonly used networks (Res-Nets)

Illustration from Visualizing the Loss Landscape of Neural Nets, Li, H et al, NIPS, 2018

A. Newson 102
Adversarial examples

Many approaches to adversarial examples exist. Goodfellow et al.†


propose a principled way of creating these

Consider the output of a fully connected layer hw, x̂i = hw, xi + hw, ri

Let us set r = sign(w). What happens to hw, x̂i ?


Increase by nm as dimension n increases (m is average value of w)
However, |r|∞ does not increase with n

Conclusion : we can add a small vector r that increases the output


response hw, x̂i


Explaining and Harnessing Adversarial Examples, Goodfellow, I.J, Shlens, J. and Szegedy, C., ICLR 2015

A. Newson 103
Adversarial examples

Goodfellow et al. consider a local linearisation of the network’s loss


around θ
L(x0 ) ≈ f (x0 ) + w∇x L(θ, x0 , y0 )

Thus, the perturbation image x̂ is set to

x̂ = x + sign(∇x L(θ, x, y))


Explaining and Harnessing Adversarial Examples, Goodfellow, I.J, Shlens, J. and Szegedy, C., ICLR 2015

A. Newson 104
Adversarial examples

Even worse, it is possible to create universal adversarial examples†


Perturbations that fool a network for any image class

Simple algorithm : initialise perturbation r, go through database


adding specific perturbations to r, project onto set { r, ||r|| < ε}
What do these perturbations look like ?


Universal adversarial perturbations, Moosavi-Dezfooli, S-M, et al arXiv preprint (2017)

A. Newson 105
Adversarial examples


Universal adversarial perturbations, Moosavi-Dezfooli, S-M, et al arXiv preprint (2017)

A. Newson 106
Adversarial examples

Conclusion : CNNs are not necessarily robust


Adversarial examples are a significant problem :
Even printed photos of adversarial examples work†

Explaining and resisting adversarial examples is currently a hot


research topic


Adversarial Examples in the Physical World, Kurakin, A., Goodfellow, I. J, Bengio, S. et al. ICLR workshop, 2017

A. Newson 107
Summary

CNNs represent the state-of-the art in many different


domains/problems

If you have an unsolved problem, there is a good chance CNNs will


produce a good/excellent result

However : theoretical understanding is still relatively limited


This leads to problems such as adversarial examples
It is not clear whether CNNs are truly robust/generalisable
This is a hot research topic, important if CNNs are to be used in
industrial applications
21/10/2021 : last lab work, on CNNs

A. Newson 108

You might also like