0% found this document useful (0 votes)
30 views164 pages

Neural Networks1

Uploaded by

madi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views164 pages

Neural Networks1

Uploaded by

madi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 164

NEURAL NETWORKS

Emmanuel Vincent, Inria Nancy – Grand Est

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing
Outline

For students in cognitive science and NLP:


 Sep 27: introduction, feedforward networks, impact of depth
 Oct 4: design choices, stochastic gradient descent
 Oct 11: convolutional networks, links with human vision
 Nov 22: advanced training, computer vision applications,
visualization

For students in NLP only:


 Dec 7: recurrent networks
 Dec 12: generative networks, NLP applications

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 2
Grading

For students in cognitive science:


 60%: written exam
 40%: lab work

For students in NLP:


 30%: written exam
 30%: lab work
 20%: multiple choice exam (theory only)
 20%: software project (group of 3 people)

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 3
Course material

This course is strongly inspired and borrows material from:

 Ian Goodfellow, Yoshua Bengio, Aaron Courville


Deep Learning
https://www.deeplearningbook.org/

 Hugo Larochelle
Online Course on Neural Networks
http://info.usherbrooke.ca/hlarochelle/neural_
networks/

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 4
Prerequisites

The course will use the following concepts from linear algebra:
 scalars (a), vectors (a), matrices (A), tensors
 vector norm
 matrix multiplication, determinant
 eigendecomposition
and the following concepts from statistics:
 random variable, discrete/continuous probability distribution
 Bernoulli, categorical, Gaussian distributions
 expectation, mean, variance, covariance
 joint, marginal, conditional probability, chain rule

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 5
INTRODUCTION
Artificial intelligence & Machine learning

Artificial intelligence (AI) = creating machines that assist humans.

Early days of AI: solve problems defined by formal rules, e.g.,


IBM’s Deep Blue chess-playing system.

Machine learning (ML) = learn from data, no hardcoded rules.

Raw data (e.g., image = pixel values) correlates poorly with the
desired output. Hence conventional ML:
 define relevant features of the data
 map them to the desired output.

Performance depends heavily on the chosen features.

Introduction 7
Impact of chosen features
PTER 1. INTRODUCTION
Example: separate two categories of data by drawing a line

 

 

e 1.1: Example of different representations: suppose we want to separate


Introduction 8
How to define relevant features?

For many tasks, manually defining relevant features is difficult.

Example: car detection in images.

Building a feature indicating the presence of a wheel is hard due to


variable shape, illumination, etc.

Also, the wheel may be hidden by an obstacle, or simply missing.

Introduction 9
Example data variabilities
6 Olga Russakovsky* et al.

Introduction 10
More data variabilities

Fig. 1 The diversity of data in the ILSVRC image classification and single-object localization tasks. For each of the eight
dimensions, we showIntroduction
example object categories along the range of that property. Object scale, number of instances and image
11
Combined variabilities Olga Russakovsky* e

PASCAL ILSVRC
birds

···
cats

···
dogs

···

g. 2 The ILSVRC dataset contains many more fine-grained classes compared to the standard PASCAL VOC benchm
example, instead of the PASCAL “dog” category there are 120 different breeds of dogs in ILSVRC2012-2014 classifica
Introduction
d single-object localization tasks. 12
Representation learning & Deep learning

Representation learning = learn optimal features from data.

Deep learning achieves this goal by learning complex features


expressed in terms of simpler features.

Each feature is computed by means of a simple parameterizable


function called neuron.

The neurons may be arranged into layers.

The set of all interconnected neurons forms a neural network.

Introduction 13
CHAPTER 1. INTRODUCTION

Deep learning
Output
CAR PERSON ANIMAL
(object identity)

3rd hidden layer


(object parts)

2nd hidden layer


(corners and
contours)

1st hidden layer


(edges)

Visible layer
(input pixels)

Figure 1.2: Illustration


Introduction
of a deep learning model. It is difficult for a computer to understand 14
the meaning of raw sensory input data, such as this image represented as a collection
CHAPTER 1. INTRODUCTION

Overview
Output

Mapping from
Output Output
features

Additional
Mapping from Mapping from layers of more
Output
features features abstract
features

Hand- Hand-
Simple
designed designed Features
features
program features

Input Input Input Input

Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning

Figure 1.5: Flowcharts showing how the different parts of an AI system relate to each
Introduction 15
other within different AI disciplines. Shaded boxes indicate components that are able to
Brief history of neural networks

0.000250
Frequency of Word or Phrase

cybernetics
0.000200
(connectionism + neural networks)
0.000150

0.000100

0.000050

0.000000
1940 1950 1960 1970 1980 1990 2000
Year

Figure 1.7: The figure shows (deep


twolearning not historical
of the three shown here)waves of artificial neural nets
research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or
“neural networks” according to Google Books (the third wave is too recent to appear). The
first wave started with cybernetics in the 1940s–1960s, with the development of theories
Introduction 16
Cybernetics (1940–1970)
Linear neuron = take input features x1 ,. . . ,xN and compute

f (x) = w1 x1 + · · · + wN xN

Weights w1 , . . . , wN fixed manually or learned from data by


stochastic gradient descent (detailed later).

Limited representation capability, e.g., cannot represent XOR:

f ([0, 1]) = 1
f ([1, 0]) = 1
f ([1, 1]) = 0
f ([0, 0]) = 0

Such limitations caused a first backlash against neural networks.

Introduction 17
Connectionism (1980–2000)
Cognitive scientists moved from symbolic reasoning to models of
cognition that could be grounded in neural implementations.

Central idea: many simple computational units can achieve various


intelligent behaviors when networked together.

Key concepts:
 distributed representation: each data is represented by many
features and vice-versa, e.g., use separate neurons for car
make and color rather than neurons for every combination,
 backpropagation algorithm to train neural networks (detailed
later), limited to shallow networks in practice.
Unrealistic commercial claims and advances in other areas (SVM,
graphical models) led to a second backlash.

Introduction 18
Deep learning (2006–?)
In 2006, Geoffrey Hinton (University of Toronto) proposed an
efficient training strategy called greedy layerwise pretraining.

His group and those of Yoshua Bengio (University of Montreal)


and Yann LeCun (New York University) showed that it helped
training deeper neural networks than had been possible before.

This third wave of popularity is still ongoing.

In hindsight, the success of deep learning is not due to greedy


layerwise pretraining which is not used anymore but to
 increased dataset sizes
 increased model sizes
 increased computational power.

Introduction 19
Increasing dataset sizes

The learning algorithms reaching human performance on complex


tasks today are nearly identical to those which struggled to solve
toy problems in the 1980s.

Big data has made machine learning easier. With larger datasets:
 the amount of human expertise required to avoid overfitting
reduces
 accuracy improves (think “law of large numbers”).

As a rule of thumb, supervised deep learning works well with


 10 million labeled training examples
 5,000 labeled examples per category.

Introduction 20
Increasing dataset sizes

10 9
Dataset size (number examples)

10 8 Canadian Hansard
WMT Sports-1M
10 7 ImageNet10k
10 6 Public SVHN
10 5 Criminals ImageNet ILSVRC 2014
10 4
MNIST CIFAR-10
10 3
10 2 T vs. G vs. F Rotated T vs. C
10 1 Iris

10 0
1900 1950 1985 2000 2015
Year
ure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statist
died datasets using hundreds or thousands of manually compiled measurements (G
0; Gosset, 1908 ; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the
Introduction 21 pi
Increasing model sizes

Biological neurons are not densely connected.

Artificial neurons already reached a similar number of connections.

Introduction 22
Increasing model sizes

10 4 Human

6 Cat
Connections per neuron

9 7
4
10 3 Mouse
2
10
5
8
10 2 Fruit fly
3
1

10 1
1950 1985 2000 2015

(1 to 10 = various neural networks)


Year
gure 1.10: Initially, the number of connections between neurons in artificial n
tworks was limited by hardware capabilities. Today, the number of connections be
urons is mostly a design consideration. Some artificial neural networks have nea
Introduction 23
any connections per neuron as a cat, and it is quite common for other neural net
Increasing model sizes

Rather than the number of connections, the total number of


neurons is key (think “brain”).

Larger networks can represent deeper, more complex concepts.

Artificial neural networks have doubled in size roughly every 2.4


years, but have remained small until quite recently.

This growth is driven by


 computers with larger memory
 suitable libraries (Theano, Torch, Caffe, Keras, TensorFlow,
PyTorch. . . ) exploiting graphical processing units (GPU)
 the availability of larger datasets.

Introduction 24
Increasing model sizes
Number of neurons (logarithmic scale)

1011 Human
1010
17 20
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
102 13
101 1 2 12 15 Roundworm
100 6 9
5 10
10−1 4 7
10−2 Sponge
1950 1985 2000 2015 2056
Year
(1 to 20 =
ure 1.11: Since the introduction of various neural artificial
hidden units, networks)neural networks have do
size roughly every 2.4 years. Biological neural network sizes from Wikipedia (20
Introduction 25
1. Perceptron (Rosenblatt, 1958, 1962)
Example image classification performance
ImageNet Large Scale Visual Recognition Challenge (1,000 classes)

Introduction 26
Example speech recognition performance

Human error rate in phone conversations (Switchboard): 5 to 10%

Introduction 27
FEEDFORWARD NETWORKS
Artificial neuron

Neuron: multiple-input single-output parametric nonlinear function.

Also known as perceptron or elementary computing unit.

Typically:
 parametric linear transform
 then fixed scalar nonlinear function.

x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3

Feedforward networks 29
Terminology

x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3

x = [x1 , . . . , xN ]T : inputs
w = [w1 , . . . , wN ]T : weights
b: bias
a(x) = w T x + b: pre-activation
g(.): activation function
h = g(a(x)): (output) activation

Feedforward networks 30
Linear activation function
5

0
h

-5
-5 0 5
a

g(a) = a
unbounded, but limited to linear relations

Feedforward networks 31
Sigmoid activation function
1.5

0.5
h

-0.5
-5 0 5
a
1
g(a) = σ(a) =
1 + exp(−a)
bounded between 0 and 1, “squashes” small/large values of a

Feedforward networks 32
Rectified linear activation function
5

0
h

-5
-5 0 5
a

g(a) = max(0, a)
lower-bounded, “squashes” a below 0, favors sparse activations
resulting neuron call rectified linear unit (ReLU)

Feedforward networks 33
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi

• w
• w
• { y1 • w x2
range

•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Feedforward networks 34
Feedforward neural network a.k.a. multilayer perceptron
Input layer Hidden Hidden Output
(features) layer #1 layer #2 layer

 DEEP LEARNING FOR


BASED
 (1)   (2) 
h1 h1
 
DISTANT-MICROPHONE
SPEECH ENHANCEMENT SPEECH
x1  (1)   (2) 
h  h  y
x = x2  h =  (1)  h =  (2) 
(1) 2 (2) 2 ŷ = 1
x3 ENHANCEMENT AND RECOGNITION —
h3  h3  y2
(1) (2)
h4 h4
EXPECTED AND UNEXPECTED RESULTS
Feedforward networks 35
Feedforward neural network a.k.a. multilayer perceptron

Every neuron has different parameters w and b.

We index them as follows:


 wji : weight of input j for neuron i at layer l
(l)

 bi : bias for neuron i at layer l


(l)

Feedforward networks 36
First hidden layer
Hidden layer #1: in scalar/vector notation

(1)
X (1) (1) (1) T (1)
ai = wji xj + bi = wi x + bi
j
(1) (1)
hi = g (1) (ai )

In matrix notation (note: g (1) applied to each entry of a (1) ):


T
a (1) = W (1) x + b (1)
h (1) = g (1) (a (1) )

T
Overall: h (1) = g (1) (W (1) x + b (1) ).

Feedforward networks 37
Other hidden layers
Hidden layer #l: in scalar/vector notation

(l)
X (l) (l−1) (l) (l) T (l)
ai = wji hj + bi = wi h (l−1) + bi
j
(l) (l)
hi = g (l) (ai )

In matrix notation (note: g (l) applied to each entry of a (l) ):


T
a (l) = W (l) h (l−1) + b (l)
h (l) = g (l) (a (l) )

T
Overall: h (l) = g (l) (W (l) h (l−1) + b (l) ).

Feedforward networks 38
Output layer
Output layer: in scalar/vector notation

(L+1)
X (L+1) (L) (L+1) (L+1) T (L+1)
ai = wji hj + bi = wi h (L) + bi
j
(L+1)
ŷi = o(ai )

with o(.) output activation function.

In matrix notation (note: o applied to each entry of a (L+1) ):


T
a (L+1) = W (L+1) h (L) + b (L+1)
ŷ = o(a (L+1) )

T
Overall: ŷ = o(W (L+1) h (L) + b (L+1) ).
Feedforward networks 39
Function represented by the neural network

Feedforward neural network with L hidden layers:


T
Hidden layer #1: h (1) = g (1) (W (1) x + b (1) )
T
Hidden layer #2: h (2) = g (2) (W (2) h (1) + b (2) )

...
T
Hidden layer #l: h (l) = g (l) (W (l) h (l−1) + b (l) )

...
T
Output layer: ŷ = o(W (L+1) h (L) + b (L+1) )

Feedforward networks 40
Function represented by the neural network
Feedforward neural network with 1 hidden layer:
ŷ = f (x)
T
= o(W (2) h (1) + b (2) )
T T
= o(W (2) g (1) (W (1) x + b (1) ) + b (2) )

Feedforward neural network with 2 hidden layers:


ŷ = f (x)
T
= o(W (3) h (2) + b (3) )
T T
= o(W (3) g (2) (W (2) h (1) + b (2) ) + b (3) )
T T T
= o(W (3) g (2) (W (2) g (1) (W (1) x + b (1) ) + b (2) ) + b (3) )

And so on.
Feedforward networks 41
IMPACT OF DEPTH
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi

• w
• w
• { y1 • w x2
range

•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Impact of depth 43
• g(a) = tanh(a) = exp(a)+exp( a) = exp(2a)+

Capacity
decision of a single
boundary of neuron
neuron 7
Réseaux de neurones
• g(a) = max(0, a)
A single neuron can• perform
classification: g(a) =binary classification.
reclin(a) = max(0, a)
terpret For instance,
neuron sigmoid can be
• p(y
as estimating seen as estimating P(y = yes|x).
= 1|x)
Also known as logistic regression classifier.
eic regression
expressive classifier des réseaux
• g(·) b de neurones
decision boundary is linear
(1) (1) linear decision
(2) boundary
(2)
x
• Wi,j2 bi xj h(x)i wi b
if y ≥ 0.5, predict• h(x) = g(a(x))
class yes ⇣ P
R(1) (1)
uches • a(x) = b 1 + W(1) x a(x)i = bi + j W
⇣ ⌘
if y < 0.5, predict• f (x) = o b(2)R+ (2) >
2
w x
x1
class no x2
• p(y = c|x) x1
x2 h
(from Pascal Vincent’s slides) Pexp(a1 )
Impact of depth • o(a) = softmax(a) = . . . Pexp(a
44
C
c exp(ac ) c exp(
ARTIFICIAL NEURON
Capacity of a single neuron
Topics: capacity of single neuron
• Can solvelinearly
Can solve linearly separable
separable problems
problems 21

OR (x1 , x2 ) AND (x1 , x2 ) AND (x1 , x2 )


1 1 1
, x2 )

, x2 )

, x2 )
0 0 0

0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
(x1 , x2 )

1 1
2)

Impact of depth 45
ARTIFICIAL NEURON
, x2

, x2

, x2
Capacity
0
Topics: of a single
capacity neuron
0
of single neuron 0

• Can’t solve
0 non 1linearly separable
0 problems...
1 0 1
Can’t solve non linearly separable problems.
(x (x .. (x1
1 1

XOR (x1 , x2 ) XOR (x1 , x2 )

AND (x1 , x2 )
1 1
, x2 )

0
?
0

0 1 0 1
(x1 AND (x1 , x2 )

• .... unless
. . unlessthe
theinput
input is
is transformed
transformed in in
a better representation
a better representation
Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. E
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x
, x2 ).ofEn
et AND (x1Impact bas, on présente l’illustration de la fonction XOR(x1 , x2 ) en
depth 46 fon
opics: single hidden layer neural network
Capacity of a R
single hidden layer
éseaux neural network
de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 cachée j -1
0
-1
0
-1
biais .5
-1
0
-1
1 1 wji 1
x1 1 1 1
x1

entrée i
x1 x2
x2
Impact of depth (from Pascal Vincent’s slides) 47
• La puissance expressive des réseaux de neurones
Capacitysingle
Topics: of a single
hiddenhidden
layer layer
neuralneural network
network
z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

(from Pascal Vincent’s slides)


Impact of depth 48
Topics:
Capacitysingle hidden
of a single layer layer
hidden neuralneural
network
network
R2
x1 x2
x1
x2

trois couches R1

... R2
R2
R1
x1 x2
x1

(from Pascal Vincent’s slides)


Impact of depth 49
Universal approximation theorem

A linear model can represent linear functions only.

Which class of functions can a neural network represent? Does it


depend on the chosen activation functions?

Universal approximation theorem [Hornik 1991]: “A single hidden


layer neural network with any “squashing” activation function and
with a linear output unit can approximate any continuous function
arbitrarily well, given enough hidden units.”

In other words, regardless of what function we are trying to learn, a


large single hidden layer network will be able to represent it.

How large is this network (i.e., how many hidden units)?

Impact of depth 50
Universal approximation theorem
[Barron, 1993] provided some bounds on the size of a single-layer
network needed to approximate a broad class of functions.

In the worst case, an exponential number of hidden units (possibly


with one hidden unit corresponding to each input configuration
that needs to be distinguished) may be required.

Example: the number of possible binary functions on vectors


v ∈ {0, 1}n is 22 . Representing one such function requires 2n bits,
n

which will in general require in the order of 2n parameters, i.e., the


model capacity is linear in the model size.

In summary, a network with a single layer is sufficient to represent


any function, but this network can generally not be learned due to
its huge size (too many parameters) which results in overfitting.

Impact of depth 51
Exponential advantage of depth

[Montufar, 2014]: the number of linear regions modeled by a


piecewise linear network (i.e., a network with ReLU neurons) with
d inputs, l layers, and n units per layer is in the order of
 d(l−1)
n
nd
d

i.e., the model capacity is exponential in the depth l (hence in the


model size).

These regions are defined by all possible products of individual


columns of W (1) , W (2) . . .

Impact of depth 52
Exponential advantage of depth
D1 D1D2

(a) (b)
D1D2D3

(c)

Figure 6. An ML-CSC model trained


(a): W trained on MNIST (each column = one image)
(1) on the MNIST data set. (a) The local filters of the dictionary D 1 . (b) The local filters of the effective dictionary
D (2) = D 1 D 2 . (c) Some of the 1,024 local atoms of the effective dictionary D (3), which are global atoms of size 28 # 28.

(b): products of columns of W (1) and W (2)


Definition (c): some products of columnsasof W (1)The, W
molecules. same signaland
(2)
can beW
(3)
described this way using
The set of K-layered ML-CSC signals of cardinalities {k 1, D 1 D 2 D 3, and so on, all the way to D 1 D 2 gD K C K .
k 2, f, k K } over the convolutional dictionaries {D 1, D 2, f, D K } Perhaps the following explanation could help in giving intu-
"" D i ,Ki =of1,depth
is defined as MSImpact " k i ,Ki = 1 ,. A signal X belongs to this ition to this wealth of descriptions of X. A human-being body
53can
Exponential advantage of depth

To sum up:
 for single hidden layer networks: in the worst case, the model
capacity is linear in the model size;
 for deeper networks: the model capacity is exponential in the
model size.

Deeper networks make it possible to learn more complex functions


without overfitting, provided that that these functions involve the
composition of several simpler functions.

Impact of depth 54
DESIGN CHOICES
Supervised training
Needs:

 labeled training set = set of examples xt and associated


targets yt related to the task (classification, detection,
regression, etc.)

 vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.

 loss function L(fθ (x), y)

Training = find θ that minimizes the total loss on the training set

1 X
T
min L(fθ (xt ), yt )
θ T
t=1

Design choices 56
Unsupervised training

Needs:

 unlabeled training set = set of examples xt without targets


 vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.
 loss function L(fθ (x))

1 X
T
Training = find θ that minimizes the total loss min L(fθ (xt ))
θ T
t=1

Less common in deep learning until recently. Typically used to:


 extract an embedding fθ (x) used as input to a supervised task,
 or pretrain a model fθ (.) on a large unlabeled dataset that will
be fine-tuned for a supervised task on a smaller labeled set,
 or train a generative model for synthetic data generation.
Design choices 57
Other forms of training

Variants (not studied hereafter) include:

 semi-supervised training: some examples xt labeled, some not

 weakly-supervised training: labels yT related to collections of


examples {xt }t∈T , rather than individual examples xt

 reinforcement learning: observed xt depends on the actions


taken by the learning agent until time t.

Design choices 58
Design choices
The task to be addressed directly translates into choosing:
 the output activation function o(.)
 the cost function L(., .)

Typically:
 define the task in a statistical manner, i.e., learn p(y|x)
(supervised) or p(x) (unsupervised),
 derive the output activation function and the cost function
corresponding to maximum likelihood (ML) estimation, i.e.:

L(ŷ, y) ∝ − log p(y|x) or − log p(x).

log p rather than p itself because total cost = sum of individual


costs rather than product.
Design choices 59
Detection

Detection: f (x) = y ∈ {0, 1}.

Example: CAPTCHA

Design choices 60
Detection

Treat as posterior probability estimation problem: ŷ = P(y = 1|x).

ŷ ∈ [0, 1] ⇒ sigmoid output activation function:

ŷ = o(a) = σ(a)

Interestingly:

P(y = 1|x) σ(a)


log = log =a
P(y = 0|x) 1 − σ(a)

⇒ a interpretable as the log-likelihood ratio between the two


classes

Design choices 61
Detection

Bernoulli likelihood function:


(
ŷ if y = 1
P(y |x) =
1 − ŷ if y = 0

ML estimation equivalent to minimizing the cross-entropy:

L(ŷ , y ) = −y log ŷ − (1 − y ) log(1 − ŷ ).

Design choices 62
Classification

Classification (three or more classes):


f (x) = y ∈ {1, . . . , n}

Example: ImageNet (1,000 classes)

Design choices 63
Classification
Treat as posterior probability estimation problem: ŷi = P(y = i|x).
n
X
ŷi ≥ 0 and ŷi = 1 ⇒ softmax activation function:
i=1

exp(ai )
ŷi = o(a)i = Pn
i 0 =1 exp(ai 0 )

Categorical likelihood function:




ŷ1
 if y = 1
.
P(y |x) = ..


ŷ if y = n
n

Design choices 64
Classification
ML estimation equivalent to minimizing the cross-entropy:
n
X
L(ŷ, y) = − yi log ŷi
i=1

with y the one-hot vector representing the class y :


 
0
 .. 
.
 
0
 
y = 1 ← y -th position

0
 
 .. 
.
0

Design choices 65
Regression

Regression: f (x) = transformed real-valued vector ŷ

Example: estimate house location, size, price, etc., from an image

Design choices 66
Regression

ŷ typically unbounded ⇒ linear output activation function:

ŷ = o(a) = a

Gaussian likelihood function with fixed covariance σ 2 I:

p(y|x) = N (y; ŷ, σ 2 I).

ML estimation equivalent to minimizing the mean squared error:

L(ŷ, y) = kŷ − yk2 .

Also applicable to unsupervised compression aka. autoencoder.

Design choices 67
Autoencoder
Autoencoder: neural network trained to predict its input.

Unsupervised learning.

Consists of two parts:


 an encoder function h = f (x)
 a decoder function x̂ = r (h) such that x̂ ≈ x

The hidden activations h provide a nonlinear representation of the


input called an embedding.

Typically, the decoder has a linear output activation function and


performance is measured by the squared error loss

L(r (f (x)), x) = kr (f (x)) − xk2

Design choices 68
[email protected] Université
h(x)de Sh
Topics: autoencoder, encoder, decoder, tied=w
Autoencoder hugo.larochelle@us =
•October 17, 2012
Feed-forward neural network trained to repro
the output layer October 16,
• Deco
Abstract
ck
x
des “Autoencoders”. b = o(b
x a
Abstract
W =W = sig
Math for my slides “Autoencoders”.
(tied weights)
for binar
j P
h(x) = bg(a(x)) P
• b l(f (x)) =
• f (x) ⌘ x k (b
xk xk )2 l(f (x)) = k
= sigm(b + Wx) Enco
W
h(x) = g(a
x = sig
b
x = o(b
a(x))
= sigm(c + W⇤ h(x))

1
P P
)) = 2 k (b
xk xk )2 l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
Design choices b
x = o(b
a69(x)
(t) (t)
Parameter tying

The parameters of the encoder and the decoder may be different,


or they may be tied (i.e., their weight matrices are transposed
versions of each other).

When applying SGD to an autoencoder with tied parameters, the


gradient of the encoder and the transposed gradient of the decoder
must be summed together.

Design choices 70
Perfect vs. approximate reconstruction

Perfect reconstruction ŷ = r (f (x)) = x is useless.

Instead, autoencoders are designed to reconstruct approximately by


 making h smaller than x,
 or adding noise to the training data,
 or adding a regularization term (detailed later)

Because the model is forced to prioritize which aspects of the input


should be reproduced, the embedding h often represents salient
features of the data.

Design choices 71
Undercomplete autoencoder

Undercomplete autoencoder = encoder where the embedding h


has fewer dimensions than the input x.

Achieves a form of “compression”.

If f (.) and r (.) are linear, this is equivalent to principal component


analysis (PCA).

An undercomplete autoencoder is thus a nonlinear generalization


of PCA.

Design choices 72
EXAMPLE OF DATA SET: M
Example
L AROCHELLE , B ENGIO , L OURADOUR AND L AMBLIN

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5: Samples fromMNIST


the MNISThandwritten
digit recognitiondigits
data set.dataset
Here, a black pixel corresponds to
an input value of 0 and a white pixel corresponds to 1 (the inputs are scaled between 0
and 1).
Design choices 73
FILTERS (AUTOENCODER
a white pixel to a weight larger than 3, with the different shades of gray corresponding
−3 and 3. et al., JMLR2009)
to different weight values uniformly between (Larochelle 93
Example
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Subset of learned weights (each column of W shown as an image)


6.7 – Display
Figure 6.6 Input weights
of the of a random
input weightssubset of the hidden
of a random subset units,
of thelearned by an learned
hidden units, autoas-
sociator
by an RBMwhen 250-dimensional
trained
when trained embedding
on samples from
on samples from forMNIST
the the
MNISTan input dimension
dataset. TheThe
dataset. of settingofisunits
display
activation the
same
of theas forhidden
first 6.6. is obtained28
Figurelayer by × 28 product
a dot = 784 of such a weight “image” with the
input image. Design
In these images, a black pixel corresponds to a weight smaller than −374and
choices
Denoising autoencoder

Denoising autoencoder: change the loss function to

L(r (f (x̃)), x) = kr (f (x̃)) − xk2

with x̃ a copy of x corrupted by some form of noise.

Encoder must learn to undo this corruption, which forces it to


learn salient features, rather than noise originally present in x.

Makes it possible to obtain embeddings h whose dimension is


bigger than the data x.

Design choices 75
Denoising autoencoder

Design choices 76
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)
• No corrupted inputs (cross-entropy loss)

Subset of learned
(a) Noweights — No
destroyed corrupted inputs
inputs (b)

Design choices 77
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 25% corrupted

nputs Subset of learned weights


(a) No
(b) 25% — 25%
destroyed corrupted inputs
inputs
destruction
(b) 25%
(c) 5

Design choices 78
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 50% corrupted

tion Subset of learned weights


(a) No — 50%
destroyed
(c) 50% corrupted inputs
inputs
destruction
(b) 25%

Design choices 79
Synthetic data generation
Synthetic data generation: compute and draw samples from p(x).

For discrete data, treat as a series of classification tasks:

p(x) = p(x1 ) × p(x2 |x1 ) × . . . p(xn |x1 , . . . , xn−1 ).

Example: language modeling / text generation

For continuous data, more complicated (see NLP-only course).


Design choices 80
STOCHASTIC GRADIENT DESCENT
Training ingredients
In this section, focus on supervised learning. Similar derivations
may be conducted for unsupervised or other forms of learning.

Training ingredients:
 training set = set of examples xt and associated targets yt
 vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.
 loss function L(fθ (x), y)

Training = find θ that minimizes the total loss on the training set
1 X
T
L(fθ (xt ), yt )
T
t=1

Nonlinear function of θ ⇒ gradient-based optimization

Stochastic gradient descent 82


Scalar gradient descent algorithm
Suppose we have a scalar function L(θ) (1 single parameter θ).

dL(θ)
The derivative L0 (θ) = gives the slope of L(θ) at point θ.

It specifies how to scale a small change in the input in order to
obtain the corresponding change in the output:

L(θ + ) ≈ L(θ) + L0 (θ).

Gradient descent: repeatedly moving θ in small steps opposite of


the gradient direction

θ ← θ − L0 (θ)

makes L(θ) gradually smaller


Stochastic gradient descent 83
Scalar gradient descent algorithm

Stochastic gradient descent 84


Stationary points
Values of θ for which L0 (θ) = 0 are called stationary points.

APTERThese include:
4. NUMERICAL COMPUTATION
 local minima, i.e., L(θ) lower than all neighboring values,
 local maxima,
 saddle points

Minimum Maximum Saddle point

Stochastic gradient descent 85


Local vs. global minimum
APTER 4. NUMERICAL COMPUTATION
Under mild conditions, gradient descent converges to a local
minimum, but not necessarily to the global minimum.

This local minimum


performs nearly as well as
the global one,
so it is an acceptable
halting point.
f (x)

Ideally, we would like


to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.

gure 4.3: Optimization algorithms may fail to find a global minimum when there
ltiple local minima
Stochastic or plateaus
gradient descent present. In the context of deep learning, we gener
86
Vector gradient descent algorithm
In the non-scalar case, consider the gradient of the function:
 
∂L(θ)
 ∂θ1 
 . 
∇θ L(θ) =   .. 

 ∂L(θ) 
∂θK

∂L(θ)
where are partial derivatives.
∂θk
Steepest descent: move θ in the direction opposite of the gradient

θ ← θ − ∇θ L(θ).

Learning rate  typically fixed to a small value.


Stochastic gradient descent 87
Vector gradient descent

Stochastic gradient descent 88


Computing the gradient of a neural network
To apply this approach to a feedforward neural network, we must
compute the partial derivative of the loss with respect to all
parameters θk (i.e., the weights and the biases of all layers):

∂L(fθ (x), y)
∂θk

To do so, we use the chain rule of derivation: if y = g(x) and


z = f (y), then
∂z X ∂z ∂yj
= .
∂xi ∂yj ∂xi
j
 T  
∂y ∂y
In matrix form: ∇x z = ∇y z with the Jacobian
∂x ∂x
matrix of g.
Stochastic gradient descent 89
Gradients in output layer
Gradient w.r.t. the outputs:

∇ŷ L(ŷ, y) = depends on the chosen loss L

Gradient w.r.t. output pre-activations:


 T
∂ ŷ
∇a (L+1) L(ŷ, y) = ∇ŷ L(ŷ, y)
∂a (L+1)
= o 0 (a (L+1) ) ∇ŷ L(ŷ, y)

where denotes elementwise multiplication

In the special case of a linear output layer ŷ = o(a) = a and a


squared error loss L(ŷ, y) = kŷ − yk2 , we have:

∇ŷ L(ŷ, y) = ∇a (L+1) L(ŷ, y) = 2(ŷ − y)


Stochastic gradient descent 90
Gradients in hidden layer

Gradient w.r.t. the activations:


!T
∂a (l+1)
∇h (l) L(ŷ, y) = ∇a (l+1) L(ŷ, y)
∂h (l)
= W (l+1) ∇a (l+1) L(ŷ, y)

Gradient w.r.t. the pre-activations:


!T
∂h (l)
∇a (l) L(ŷ, y) = ∇h (l) L(ŷ, y)
∂a (l)
= g 0 (a (l) ) ∇h (l) L(ŷ, y)

Stochastic gradient descent 91


Gradients w.r.t. weights and biases
Gradient w.r.t. the biases:
!T
∂a (l)
∇b (l) L(ŷ, y) = ∇a (l) L(ŷ, y)
∂b (l)
= ∇a (l) L(ŷ, y)

Gradient w.r.t. a row wj of the weight matrix W (l) :


(l)

!T
∂a (l)
∇w (l) L(ŷ, y) = (l)
∇a (l) L(ŷ, y)
j ∂wj
(l−1)
= hj ∇a (l) L(ŷ, y)

T
In matrix form: ∇W (l) L(ŷ, y) = ∇a (l) L(ŷ, y) h (l−1)
Stochastic gradient descent 92
Backpropagation algorithm
By considering all those expressions together, we obtain the
so-called backpropagation algorithm.

Compute the gradient g w.r.t. the output pre-activations:


g ← ∇ŷ L(ŷ, y)
g ← ∇a (L+1) L(ŷ, y) = o 0 (a (L+1) ) g
For l = L + 1, L − 1, . . . , 1:
Compute gradient w.r.t. biases and weights:
∇b (l) L(ŷ, y) = g
T
∇W (l) L(ŷ, y) = g h (l−1)
Propagate gradient to lower layer:
g ← ∇h (l−1) L(ŷ, y) = W (l) g
g ← ∇a (l−1) L(ŷ, y) = g 0 (a (l−1) ) g
Endfor
Stochastic gradient descent 93
Computational
Implementation Flow
of forward pass Graph
• Forward propagation can be represented
as an acyclic flow graph
In practice, forward pass implemented by
• Forward propagation can be implemented
 modeling
in a modular way: the network as an acyclic
computational flow graph,
Ø Each box can be an object with an fprop
 associating each box with an fprop
method, that computes the value of the
method, that computes the value of
box given its children
the box given its children,
Ø  calling
Calling themethod
the fprop method
fprop of of each
each box in box
the right in
order yields forward
bottom-up order. propagation

Stochastic gradient descent 94


Computational
Implementation Flow
of backpropagation Graph
• Forward propagation
Similarly, can be represented
backpropagation implemented by
as an acyclic flow graph
 associating each box with a bprop
• Forwardmethod,
propagation can be implemented
that computes the gradient
in a modular way:
with respect to each child box,

Ø Each calling
 box can the
be an method
object
bprop with anoffprop
each box
method, in reverse,
that top-down
computes order.
the value of the
box given its children
This differentiable programming approach
avoidsthe
Ø Calling computing the gradient
fprop method manually.
of each box in
the right order yields forward propagation
Introduced in Theano. Now standard in all
libraries (TensorFlow, PyTorch. . . )

Stochastic gradient descent 95


Back to gradient descent
Using backpropagation, we can compute the gradient of the loss
for each training example (xt , yt ) w.r.t. the network parameters θ:

∇θ L(fθ (xt ), yt )

The gradient of the total loss over the training set is therefore:
" #
1 X 1 X
T T
∇θ L(fθ (xt ), yt ) = ∇θ L(fθ (xt ), yt )
T T
t=1 t=1

Computing the gradient exactly is very expensive because it


requires a pass on the entire dataset at every iteration of the
gradient descent algorithm.

Stochastic gradient descent 96


Stochastic gradient descent
Stochastic gradient descent (SGD) algorithm:
 at each iteration i, compute the gradient over a random
subset Ti and perform one step of gradient descent:
1 X
θ ←θ−× ∇θ L(fθ (xt ), yt )
T
t∈Ti

 at the next iteration, pick another random subset


 when the entire dataset has been passed, start over again

The subsets must be random, disjoint, and cover the entire dataset.

Terminology:
 Ti is called a minibatch
 each pass over the entire dataset is called an epoch
Stochastic gradient descent 97
Stochastic gradient descent

Splitting the training set into B minibatches


 reduces the computation cost of one gradient by a factor of B
 increases √
the standard deviation on the gradient estimate by a
factor of B only.

More iterations, but fewer epochs (hence smaller total


computation cost).

Stochastic gradient descent 98


Practical implementation considerations

In practice, the gradients ∇θ L(fθ (xt ), yt ) for all examples (xt , yt )


are computed in parallel using a general-purpose graphical
processing unit (GPU) and summed within a given minibatch.

The choice of the minibatch size is governed by the following


practical considerations:
 the minibatch data and computations must fit in GPU memory
 too small minibatches do not exploit well GPU capabilities
 some kinds of hardware perform better with power-of-2 sizes

Typical minibatch sizes: from 32 to 256.

Stochastic gradient descent 99


Limitation of (stochastic) gradient descent

20

10

0
x2

−10

−20

−30
−30 −20 −10 0 10 20
Tends to “zigzag” when descending a “canyon”, which increases
x 1

the number of iterations


4.6: Gradient descent fails to exploit the curvature information contain
matrix. Here we use
Stochastic gradient
gradient descent descent to minimize a quadratic function
100 f(
Stochastic gradient descent with momentum
Solution: smooth the gradient estimates across several iterations.

Momentum = vector v representing the direction and speed at


which the parameters move through parameter space.

Defined as an exponentially decaying average of the negative


gradient.

SGD with momentum: initialize v = 0, then replace each iteration


of SGD by:
1 X
v ← αv −  × ∇θ L(fθ (xt ), yt )
T
t∈Ti
θ ←θ+v

Stochastic gradient descent 101


Stochastic gradient descent with momentum

20

10

−10

−20

−30
−30 −20 −10 0 10 20
Converges in fewer iterations than conventional SGD
8.5: Momentum aims primarily to solve two problems: poor condition
matrix and variance in the stochastic gradient. Here, we illustrate how m
mes the firstStochastic
of these two problems. The contour lines depict a 102
gradient descent
quad
Local optima
When properly tuned (learning rate not too large nor too small),
SGD converges to a local minimum.

How many local minima are they? Are they good or bad?

Neural networks always have multiple local minima because of


model identifiability issues:
 reordering the neurons in each layer (n!m possible orders for m
layers with n units each),
 or scaling the incoming weights and biases of a ReLU neuron
by α and its outgoing weights by 1/α
do not change the value of the cost function.

⇒ there can be a large or infinite number of local minima, but


they are are all equivalent to each other (hence not a problem).
Stochastic gradient descent 103
Local optima

For many years, people believed that large neural networks failed
because of poor local minima.

Recent theoretical and experimental results suggest that, for


sufficiently large neural networks:
 most stationary points are saddle points corresponding to a
high value of the cost function,
 but SGD manages to avoid them;
 most local minima correspond to a low value of the cost
function.

Stochastic gradient descent 104


Vanishing/exploding gradient

At each layer of the backpropagation algorithm, the gradient is


multiplied by W (l) and scaled by g 0 (a (l−1) ).

Muliplying by W (l) scales the activations in the order of the largest


eigenvalue of W (l) which can be  1 or  1.

g 0 (a (l−1) ) is typically ≤ 1.

For very deep networks, this can cause:


 vanishing gradient, g ≈ 0 ⇒ slow learning
 exploding gradient, kgk  1 ⇒ unstable learning.

Problem especially common for recurrent networks (see later) and


when g 0 (a) ≈ 0 for a large range of a values.

Stochastic gradient descent 105


Sigmoid unit
1.5
g( x)
g ′(x)
1

0.5
h

-0.5
-5 0 5
a
1
g(a) = σ(a) = ⇒ g 0 (a) = σ(a)(1 − σ(a))
1 + exp(−a)
g 0 (a) ≈ 0 for many (large) negative or positive values

Stochastic gradient descent 106


Rectified linear unit
5
g( x)
4 g ′(x)

2
h

-1
-5 0 5
a (
0 if x < 0
g(a) = max(0, a) ⇒ g 0 (a) =
1 if x ≥ 0
better than sigmoid but still g 0 (a) = 0 for all a ≤ 0 ⇒ problem if a
gets stuck there, solved by batch normalization (see later)
Stochastic gradient descent 107
CONVOLUTIONAL NETWORKS
Convolutional neural networks

Convolutional neural networks (CNNs) were invented by LeCun


[1989].

CNN = network that uses a linear operation called convolution


instead of general matrix multiplication in at least one layer.

CNNs are appropriate for processing data with a known, grid-like


topology:
 1D: time-series data (speech, text, finance, biosignals. . . ),
 2D: images,
 3D: video,
 etc.

Convolutional neural networks 109


1D convolution
Given two sequences xt and kt , the convolved sequence at is
defined by the following linear operation:
X∞
at = xτ kt−τ
τ =−∞

The convolution operation is typically denoted with an asterisk:


a =x ∗k

Terminology:
 x: input (or signal)
 k: convolution kernel (or filter)
 a: feature map

Examples: sliding window averaging, smoothing, etc

Convolutional neural networks 110


1D convolution

Convolutional neural networks 111


2D convolution

Convolution can be generalized to any dimension.

For instance, in 2D:



X ∞
X
Ai,j = (X ∗ K )i,j = Xm,n Ki−m,j−n
m=−∞ n=−∞

Examples: edge detection, sharpening, blurring. . .

Convolutional neural networks 112


2D convolution

     
0 0 0 −1 −1 −1 0 −1 0
K = 0 1 0 K = −1 8 −1 K = −1 5 −1
0 0 0 −1 −1 −1 0 −1 0

Convolutional neural networks 113


2D convolution

     
1 1 1 1 2 1 −1 −2 −1
1 1  1 
K= 1 1 1 K= 2 4 2 K= −2 28 −2
9 16 16
1 1 1 1 2 1 −1 −2 −1

Convolutional neural networks 114


Practical implementation

In pratice:

 Convolution is commutative:

X ∞
X
Ai,j = (X ∗ K )i,j = (K ∗ X)i,j = Km,n Xi−m,j−n .
m=−∞ n=−∞

 K has finite length M × N, hence m and n span finite


intervals:
M−1
X N−1
X
Ai,j = (X ∗ K )i,j = (K ∗ X)i,j = Km,n Xi−m,j−n .
m=0 n=0

Convolutional neural networks 115


Convolution vs. cross-correlation

Convolution is similar to cross-correlation:


M−1
X N−1
X
Ai,j = Km,n Xi+m,j+n .
m=0 n=0

The difference only lies in “flipping” the kernel.

Most machine learning libraries implement cross-correlation but


call it convolution.

Since kernel learned from data, makes no difference.

In the following, “convolution” refers to both convolution and


cross-correlation.

Convolutional neural networks 116


Cross-correlation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict 117
Convolutional neural networks
the output to only positions where the kernel lies entirely within the image, called “valid”
Motivation for CNNs

Convolution leverages three key machine learning concepts:


 sparse connectivity,
 parameter sharing,
 equivariant representation.

In addition, it provides a means for handling variable size inputs.

Convolutional neural networks 118


Sparse connectivity

Traditional networks: every output fully connected to every input.

CNNs: every output connected only to a few neighboring inputs


due to small kernel.

Still, in a deep CNN, each unit in the upper layers can indirectly
interact with many inputs forming its receptive field.

Benefits:
 smaller model size,
 less overfitting,
 faster computation.

Convolutional neural networks 119


Sparse connectivity
CHAPTER 9. CONVOLUTIONAL NETWORKS

Convolutional
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Figure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s3 ,
and also
Figure highlight
9.3: Sparse the input units
connectivity, in x from
viewed that above:
affect this unit. These
We highlight oneunits
outputareunit,
knowns3 ,
Convolutional neural networks 120
as the
and receptive
also highlightfield of s3 . units
the input (Top)When s isaffect
in x that formed
thisbyunit.
convolution with aare
These units kernel
known of
igure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s
nd also highlight the input units in x that affect this unit. These units are know
Receptive field
s the receptive field of s3 . (Top)When s is formed by convolution with a kernel
idth 3, only three inputs affect s 3 . (Bottom)When s is formed by matrix multiplicatio
onnectivity is no longer sparse, so all of the inputs affect s3 .

g1 g2 g3 g4 g5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

igure 9.4: The receptive field of the units in the deeper layers of a convolutional networ
larger than the receptive field of the units in the shallow layers. This effect increases
he network includes architectural features like strided convolution (figure 9.12) or poolin
ection 9.3). This means that even though direct connections in a convolutional net ar
ery sparse, units in the deeper layers can be indirectly connected to all or most121of th
Convolutional neural networks
Receptive field

Convolutional neural networks 122


Parameter sharing

Traditional networks: every parameter used exactly once.

CNNs: parameters shared (tied) across all positions in the input.

Benefits:
 even smaller model size,
 even less overfitting.

Convolutional neural networks 123


Parameter sharing
HAPTER 9. CONVOLUTIONAL NETWORKS
HAPTER 9. CONVOLUTIONAL NETWORKS
Convolutional

s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

igure 9.5: Parameter sharing: Black arrows indicate the connections that use a particula
arameter
igure 9.5: in two different
Parameter sharing: models.
Black (Top)The black the
arrows indicate arrows indicatethat
connections usesuse
of athe centr
particula
Convolutional neural networks 124
ement of a 3-element kernel in a convolutional model. Due to parameter sharing, th
Parameter sharing

Convolutional neural networks 125


Computational efficiency

Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
eachExample edge
pixel in the detection
original image and (280 × 320 pixels):
subtracting the value of its neighboring pixel on the
left. This shows the strength of all of the vertically oriented edges in the input image,
whichcanconvolutional:
be a useful operation1 × 2for kernel,
object 280 × 319Both
detection. × 3 images
≈ 2 × are
105280 pixels tall.
The input operations
image is 320 pixels wide while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing 9two elements, and
 319
requires fully× connected:
280 × 3 = 267, 280 320 × point
960×floating 280 ×operations
319 ≈ 8(two × 10multiplications
weights, and
≈ 16 × 10 operations
one addition per 9
output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320 × 280× 319 × 280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representingConvolutional
this transformation.
neural networks
The straightforward matrix multiplication algorithm 126
Equivariance
Equivariance: if the input is transformed, the output is transformed
in the same way.

CNN layers are equivariant to translation: if the input signal is


shifted by τ , the output feature map is also shifted by τ .

Hence the term “feature map”: provides a map showing where


different features appear in the input.

Example: when detecting objects in images, because each object


will look the same anywhere in the image, it is practical to share
parameters across the entire image.

Note: convolution is not equivariant to other transformations, e.g.,


scaling or rotation.

Convolutional neural networks 127


CNN layer
One CNN layer l typically consists of three successive stages:

 linear convolution by multiple kernels Kf , each extracting a


(l)

different feature at many spatial locations from the input X or


from all outputs H (l−1) of the previous layer l − 1,

or
(1) (1) (l) (l)
Af = X ∗ Kf Af = H (l−1) ∗ Kf

 nonlinear activation of all feature maps (feature detection),


(l) (l)
H̃f = g (l) (Af )

 pooling across neighboring points in each map.

= pool(H̃f )
(l) (l)
Hf

Convolutional neural networks 128


CNN layer Complex layer terminology Simple layer terminology

Next layer Next layer

Convolutional Layer

Pooling stage Pooling layer

Detector stage:
Detector layer: Nonlinearity
Nonlinearity
e.g., rectified linear
e.g., rectified linear

Convolution stage: Convolution layer:


Affine transform Affine transform

Input to layer Input to layers

Convolutional neural networks 129


Dealing with multichannel inputs
By contrast with basic 1D/2D convolution, each entry of the grid
is usually not a scalar but a vector consisting of several channels:
 red, green, blue intensities at each pixel of an input image,
 multiple outputs from the previous layer.

The inputs are therefore 3D arrays (a.k.a. tensors) Xi,j,k : i, j index


the spatial coordinates, k indexes the input channels.

Convolution over the channels does not make sense.

In standard multichannel convolution, each input channel is fully


connected to all feature maps (a.k.a. output channels):
X M−1
X N−1
X
Ai,j,f = (X ∗1,2 K )i,j,f = Xm,n,k Ki+m,j+n,f ,k
k m=0 n=0

Convolutional neural networks 130


Dealing with multichannel inputs

Convolutional neural networks 131


Dealing with multichannel inputs
Standard 2D convolution requires F 3D kernels of size M × N × K ,
i.e., M × N × K × F parameters and multiplications for each
output pixel.

Depthwise separable convolution: first convolve the pixels in each


of the K input channels by one 2D kernel of size M × N (depthwise
convolution), then multiply the channels in each pixel by an F × K
matrix (pointwise convolution, a.k.a. 1 × 1 convolution).

X M−1
X N−1
X
Ai,j,f = (X ∗1,2 K )i,j,f = Kfpointwise
,k
depthwise
Xm,n,k Ki+m,j+n,k
k m=0 n=0

Depthwise separable convolution requires only M × N × K + K × F


parameters and multiplications for each output pixel. Benefits:
even smaller model size, even less overfitting, faster computation.
Convolutional neural networks 132
Zero-padding
Zero-padding = replacing undefined values of the input by 0.

Control the kernel width and the output size independently.

When X and K have finite sizes I and M along a given spatial


dimension, the size of A = X ∗ K along that dimension is equal to:

 I − M + 1 if the convolution kernel must remain entirely


within the image (no zero-padding, a.k.a. valid convolution)

 I if padding keeps the output size equal to the input (a.k.a.


same convolution)

 I + M − 1 if enough zeroes are added for every pixel to be


visited M times (a.k.a. full convolution)

Convolutional neural networks 133


Zero-padding

Zero-padding implies some boundary effects:


 either the border pixels are underrepresented in the model
(“valid”, “same”)
 or the border pixels are a function of fewer pixels, which
makes it difficult to learn a single kernel that performs well at
all positions (“full”)

Usually the optimal amount of zero padding lies somewhere


between “valid” and “same”.

Convolutional neural networks 134


CHAPTER 9. CONVOLUTIONAL NETWORKS

CHAPTER 9. CONVOLUTIONAL NETWORKS

Zero-padding
No zero padding (a.k.a. “valid”)

...

...
... ...

... ...

“same”
... ...
... ...
... ...
... ...
... ...
... ...
Figure 9.13: The effect of zero padding on network size: Consider a convolutional network
with a kernel of width six at every layer. In this example, we do not use any pooling, so
Figure
only the9.13:
convolution operation
The effect itself shrinks
of zero padding size: Consider
the network
on network a convolutional
size. (Top)In network
this convolutional
with a kernel
network, of width
we neural
Convolutional do sixany
at every
notnetworks
use layer.
implicit zeroInpadding.
this example,
Thiswe do not
causes theuse any pooling, so
representation to 135
Pooling

Pooling: replace neighboring points in a given feature map (a.k.a.


channel) H̃f by a single summary statistic.

Max pooling: maximum within a rectangular neighborhood

Hi,j,f = max H̃i+m,j+n,f


m∈{0,...,M−1}
n∈{0,...,N−1}

Other popular pooling functions:


 average of a rectangular neighborhood,
 `2 norm of a rectangular neighborhood,
 weighted average based on the distance from the central pixel.

Convolutional neural networks 136


Translation invariance
Pooling makes the representation more invariant to small
translations of the input.
POOLING STAGE

1. 1. 1. 0.2
... ...

... 0.1 1. 0.2 0.1 ...

DETECTOR STAGE

POOLING STAGE

... 0.3 1. 1. 1. ...

... 0.3 0.1 1. 0.2 ...

DETECTOR STAGE

FigureConvolutional
9.8: Max pooling introduces invariance. (Top)A view of the middle of the output
neural networks 137
Translation invariance

Translation invariance is useful when we care more about whether


some feature is present than exactly where it is.

Example: to detect a face, we search for an eye on the left side


and an eye on the right side, but we don’t need to locate them
with pixel-level accuracy.

Counter-example: to denoise an image, we must preserve the


location of the features. In that situation pooling is not desirable.

Convolutional neural networks 138


Figure 9.9: Example of learned invariances: A pooling unit that pools over multiple features
that are learned with separate parameters can learn to be invariant to transformations of

Stride the input. Here we show how a set of three learned filters and a max pooling unit can learn
to become invariant to rotation. All three filters are intended to detect a hand-written 5.
Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector
unit. The max pooling unit then has a large activation regardless of which detector unit
was activated. We show here how the network processes two different inputs, resulting
Since pooling summarizes the responses over a neighborhood,
in two different detector units being activated. The effect on the pooling unit is roughly

report summary statistics every S pixels instead of every 1 pixel.


the same either way. This principle is leveraged by maxout networks (Goodfellow et al.,
2013a) and other convolutional networks. Max pooling over spatial positions is naturally
invariant to translation; this multi-channel approach is only necessary for learning other
transformations.
S is called the stride. It can be different for each direction in space.

1. 0.2 0.1

0.1 1. 0.2 0.1 0.0 0.1

Figure 9.10: Pooling with downsampling. Here we use max-pooling with a pool width of

This improves
of two, which computational efficiency because thenextnext
layer.layer
Note has
three and a stride between pools of two. This reduces the representation size by a factor
reduces the computational and statistical burden on the
roughly S times fewer inputs to process.
that the rightmost pooling region has a smaller size, but must be included if we do not
want to ignore some of the detector units.

344

Convolutional neural networks 139


CHAPTER 9. CONVOLUTIONAL NETWORKS

Stride
Efficient implementation based on strided convolution
s1 s2 s3

s1 s2 s3

Strided
convolution
Strided
convolution
x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

Inefficient implementation
s1 s2 s3

s1 s2 s3
Downsampling

Downsampling

z1 z2 z3 z4 z5

z1 z2 z3 z4 z5

Convolution

Convolution
x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

Figure 9.12: Convolution with a stride. In this example, we use a stride of two.
(Top)Convolution
Figure 9.12:neural with a stride
Convolution length of twoInimplemented in awesingle
use operation.
a stride of(Bot-
Convolutional networks withgreater
with
a stride. this example, two. 140
tom)Convolution
(Top)Convolution with aa stride than
stride length of twoone pixel is mathematically
implemented equivalent
in a single operation. to
(Bot-
Variable size inputs

Pooling also helps handling inputs of varying size.

Example: when classifying images of variable size, the input to the


classification layer must have a fixed size.

Solution: vary the pool size but not the number of pools such that
the classification layer always receives the same number of
summary statistics regardless of the input size.

For instance, pool over the whole image, independently of the


image size.

Convolutional neural networks 141


Example CNN architectures for image classification
CHAPTER 9. CONVOLUTIONAL NETWORKS

Output of softmax: Output of softmax: Output of softmax:

For illustration purposes only 1,000 class


probabilities
1,000 class
probabilities
1,000 class
probabilities

(modern CNNs are deeper) Output of matrix Output of matrix Output of average
multiply: 1,000 units multiply: 1,000 units pooling: 1x1x1,000

 Left: 2 convolutional + 1
Output of reshape to Output of reshape to Output of
vector: vector: convolution:
16,384 units 576 units 16x16x1,000

fully connected layer Output of pooling


Output of pooling to
Output of pooling

(fixed-sized image)
with stride 4: with stride 4:
3x3 grid: 3x3x64
16x16x64 16x16x64

Output of Output of Output of


convolution + convolution + convolution +

 Middle: 2 convolutional + ReLU: 64x64x64 ReLU: 64x64x64 ReLU: 64x64x64

1 fully connected layer


Output of pooling Output of pooling Output of pooling
with stride 4: with stride 4: with stride 4:

(variable-sized image)
64x64x64 64x64x64 64x64x64

Output of Output of Output of


convolution + convolution + convolution +
ReLU: 256x256x64 ReLU: 256x256x64 ReLU: 256x256x64

 Right: fully convolutional Input image: Input image: Input image:


256x256x3 256x256x3 256x256x3

Figure 9.11: Examples of architectures for classification with convolutional networks. The
specific strides and depths used in this figure are not advisable for real use; they are
designed to be very shallow in order to fit onto the page. Real convolutional networks
also often involve significant amounts of branching, unlike the chain structures used
Convolutional neural networks here for simplicity. (Left)A convolutional network that processes a fixed image 142size.
After alternating between convolution and pooling for a few layers, the tensor for the
Graphical representation of CNN architectures

Example: LeNet-5 for digit recognition

Convolutional neural networks 143


Graphical representation of CNN architectures

Example: AlexNet for image classification

Convolutional neural networks 144


Graphical representation of CNN architectures
Example: VGG-16 for image classification

Convolutional neural networks 145


LINKS WITH HUMAN VISION
Visual cortex

Links with human vision 147


Previews
Visual cortex
Wallisch & Movshon (2008)

Figure 1. A Scaled Representation of thevision


Links with human Cortical Visual Areas of the Macaque 148
Each colored rectangle represents a visual area, for the most part following the names and definitions used by Felleman and Van Essen (1991). The gray bands
Visual cortex

Links with human vision 149


Receptive field measurement

Reverse correlation approach for receptive field measurement in


animals:

 put an electrode in an individual neuron,

 display several white noise images in front of the retina,

 record the resulting neuron responses,

 fit a linear model in order to estimate the neuron’s weights.

Note: record first 100 ms only. After that, information begins to


flow backwards as the brain uses top-down feedback.

Links with human vision 150


Retina and lateral geniculate nucleus (LGN)
Foveal retina cells → parvocellular (P) layers of the LGN.
Provide fine details required to determine what an object is.

Peripheral retina cells → magnocellular (M) layers of the LGN.


Provide coarse information as Receptive
to where an Fields
object is.
V1
Their receptive fields are round.
Retina LGN

Simple 2-lobe

ON/OFF ON/OFF

Simple 3-lobe

OFF/ON OFF/ON

Links with human vision 151


Primary visual cortex (V1)
V1 has a two-dimensional structure mirroring the structure of the
image in the retina.

It contains three cell types [Hubel & Wiesel, 1981 Nobel Prize]:
 layer 4 cells have round receptive fields like retina/LGN cells,
 simple
V1) contains 3 cells
cellhave elongated, localized receptive fields,
types
 complex cells also respond to elongated features, but are
e receptive invariant
fields aretoround
smalllike
shifts in the
those of position
LGN and of ganglion
the feature.
cells,
ongated
mally
lar

e
hose of
e can lie
and these fireLinksmore to moving lines.
with human vision 152
to Simple
those of cells
hose

line can lie


na andSeveral retina
these fire cells,
more to whose
movingrecep-
lines.
tive fields lie along a common line,
converge onto
plex receptive a simple
fields cell.
produced?
The corresponding weights are
well receptive
lls, whose modeled by a Gabor
fields function
lie along a
way ofwith specific:
the LGN onto a simple cell.
maximally for a horizontal line and
 spatial position
e. Each simple cell is selective for one
 spatial orientation
 spatial scale along both axes
 spatial frequency
mple cells of the same orientation
 spatial phase
ell.

Links with human vision 153


CHAPTER 9. CONVOLUTIONAL NETW
Simple cells
Varying spatial position and orientation

Links with human vision 154


UTIONAL NETWORKS
Simple cells
Varying spatial scale

Links with human vision 155


Simple cells
Varying spatial frequency and phase

Links with human vision 156


lls, whose receptive fields lie along a
way of the LGN onto a simple cell.
Complex
maximally for acells
horizontal line and
e. Each simple cell is selective for one

mple cells of the same orientation


ell.
Several simple cells of the same
orientation converge onto a com-
plex cell.

and clinically relevant?

how the visual system can


tations from simple features.
of a simple cell is tuned to a very
mplex cell generalizes this
Links with human vision 157
Link with artificial CNNs

Simple cell ↔ Convolution by Gabor atom + nonlinear activation

Complex cell ↔ Intra/cross-channel pooling

Links with human vision 158


Secondary visual cortex (V2) and color center (V4)

V2 cells:
 are tuned to simple properties similar to V1: spatial
orientation, frequency, color. . .
 but also: figure vs. ground, multiple orientations in different
regions of a receptive field, some attentional modulation.

V4 cells:
 are tuned to simple properties similar to V2,
 but also: object features of intermediate complexity, like
simple geometric shapes, strong attentional modulation.

Detect more and more complex features similarly to moving up the


layers of a CNN.

Links with human vision 159


Inferotemporal (IT) cortex

Two types of cells found in the


IT cortex:

 primary cells respond to


slits, spots, ellipses,
squares

 elaborate cells respond to


specific complex shapes
(influenced by color,
texture)

Links with human vision 160


Inferotemporal (IT) cortex
Fusiform face area of the IT cortex: cells responsive to faces.

Links with human vision 161


Link with artificial CNNs
Predictions of single site IT responses from layer 4 of HMO 1.0 model
IT = closest analog to a CNN’s last layer of features.

Evidence fromaresingle-site
These IT firing
PREDICTIONS: All ofrates.
these objects and images were
never previously seen by the HMO model

d
HMO (M1 IT only)

HMO (M2 IT only)

Unit 1: r 2 = 0.48

Response* of
IT neural site
HMO

Prediction of Animals Boats Cars Chairs Faces Fruits Planes Tables


HMO model
© Proceedings of the National Academy of Sciences. All rights reserved. This content is excluded from
our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Source: Yamins, Daniel LK, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James
J. DiCarlo. "Performance-optimized hierarchical models predict neural responses in higher visual cortex."
Proceedings of the National Academy of Sciences 111, no. 23 (2014): 8619-8624.

Yamins, Hong, Solomon, Seibert


(* mean Links
rate with human
70-170 msvision
after image onset) and DiCarlo PNAS (2014)
162
Link with artificial
Better CNNsdeep CNN networks also better
performing
predict the patterns of IT neural responses
Evidence from representation dissimilarity matrices.
B V4 Cortex IT Cortex
animals
Neural Representations

1.0
cars

relative distance
chairs

faces

fruits

planes

tables 0.0

HMO Krizhevsky et al. 2012 Zeiler & Fergus 2013


Model Representations
+ IT-fit

Cadieu, Charles F., Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A. Solomon, Najib J.
Majaj, and James J. DiCarlo. "Deep neural networks rival the representation of primate IT cortex for
ons

Links with human


core visual vision
object recognition. "PLoS Comput Biol 10, no. 12 (2014): e1003963; 163
Remaining differences with artificial CNNs

human vision artificial CNN


low-resolution peripheral retina cells high-resolution image
continuous eye movement (saccades) still image
integration with other senses (often) images only
biological detection and pooling ReLU, max pooling
top-down feedback bottom-up processing
unknown learning mechanism SGD

Links with human vision 164

You might also like