Neural Networks1
Neural Networks1
University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing
Outline
University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 2
Grading
University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 3
Course material
Hugo Larochelle
Online Course on Neural Networks
http://info.usherbrooke.ca/hlarochelle/neural_
networks/
University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 4
Prerequisites
The course will use the following concepts from linear algebra:
scalars (a), vectors (a), matrices (A), tensors
vector norm
matrix multiplication, determinant
eigendecomposition
and the following concepts from statistics:
random variable, discrete/continuous probability distribution
Bernoulli, categorical, Gaussian distributions
expectation, mean, variance, covariance
joint, marginal, conditional probability, chain rule
University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 5
INTRODUCTION
Artificial intelligence & Machine learning
Raw data (e.g., image = pixel values) correlates poorly with the
desired output. Hence conventional ML:
define relevant features of the data
map them to the desired output.
Introduction 7
Impact of chosen features
PTER 1. INTRODUCTION
Example: separate two categories of data by drawing a line
Introduction 9
Example data variabilities
6 Olga Russakovsky* et al.
Introduction 10
More data variabilities
Fig. 1 The diversity of data in the ILSVRC image classification and single-object localization tasks. For each of the eight
dimensions, we showIntroduction
example object categories along the range of that property. Object scale, number of instances and image
11
Combined variabilities Olga Russakovsky* e
PASCAL ILSVRC
birds
···
cats
···
dogs
···
g. 2 The ILSVRC dataset contains many more fine-grained classes compared to the standard PASCAL VOC benchm
example, instead of the PASCAL “dog” category there are 120 different breeds of dogs in ILSVRC2012-2014 classifica
Introduction
d single-object localization tasks. 12
Representation learning & Deep learning
Introduction 13
CHAPTER 1. INTRODUCTION
Deep learning
Output
CAR PERSON ANIMAL
(object identity)
Visible layer
(input pixels)
Overview
Output
Mapping from
Output Output
features
Additional
Mapping from Mapping from layers of more
Output
features features abstract
features
Hand- Hand-
Simple
designed designed Features
features
program features
Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning
Figure 1.5: Flowcharts showing how the different parts of an AI system relate to each
Introduction 15
other within different AI disciplines. Shaded boxes indicate components that are able to
Brief history of neural networks
0.000250
Frequency of Word or Phrase
cybernetics
0.000200
(connectionism + neural networks)
0.000150
0.000100
0.000050
0.000000
1940 1950 1960 1970 1980 1990 2000
Year
f (x) = w1 x1 + · · · + wN xN
f ([0, 1]) = 1
f ([1, 0]) = 1
f ([1, 1]) = 0
f ([0, 0]) = 0
Introduction 17
Connectionism (1980–2000)
Cognitive scientists moved from symbolic reasoning to models of
cognition that could be grounded in neural implementations.
Key concepts:
distributed representation: each data is represented by many
features and vice-versa, e.g., use separate neurons for car
make and color rather than neurons for every combination,
backpropagation algorithm to train neural networks (detailed
later), limited to shallow networks in practice.
Unrealistic commercial claims and advances in other areas (SVM,
graphical models) led to a second backlash.
Introduction 18
Deep learning (2006–?)
In 2006, Geoffrey Hinton (University of Toronto) proposed an
efficient training strategy called greedy layerwise pretraining.
Introduction 19
Increasing dataset sizes
Big data has made machine learning easier. With larger datasets:
the amount of human expertise required to avoid overfitting
reduces
accuracy improves (think “law of large numbers”).
Introduction 20
Increasing dataset sizes
10 9
Dataset size (number examples)
10 8 Canadian Hansard
WMT Sports-1M
10 7 ImageNet10k
10 6 Public SVHN
10 5 Criminals ImageNet ILSVRC 2014
10 4
MNIST CIFAR-10
10 3
10 2 T vs. G vs. F Rotated T vs. C
10 1 Iris
10 0
1900 1950 1985 2000 2015
Year
ure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statist
died datasets using hundreds or thousands of manually compiled measurements (G
0; Gosset, 1908 ; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the
Introduction 21 pi
Increasing model sizes
Introduction 22
Increasing model sizes
10 4 Human
6 Cat
Connections per neuron
9 7
4
10 3 Mouse
2
10
5
8
10 2 Fruit fly
3
1
10 1
1950 1985 2000 2015
Introduction 24
Increasing model sizes
Number of neurons (logarithmic scale)
1011 Human
1010
17 20
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
102 13
101 1 2 12 15 Roundworm
100 6 9
5 10
10−1 4 7
10−2 Sponge
1950 1985 2000 2015 2056
Year
(1 to 20 =
ure 1.11: Since the introduction of various neural artificial
hidden units, networks)neural networks have do
size roughly every 2.4 years. Biological neural network sizes from Wikipedia (20
Introduction 25
1. Perceptron (Rosenblatt, 1958, 1962)
Example image classification performance
ImageNet Large Scale Visual Recognition Challenge (1,000 classes)
Introduction 26
Example speech recognition performance
Introduction 27
FEEDFORWARD NETWORKS
Artificial neuron
Typically:
parametric linear transform
then fixed scalar nonlinear function.
x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3
Feedforward networks 29
Terminology
x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3
x = [x1 , . . . , xN ]T : inputs
w = [w1 , . . . , wN ]T : weights
b: bias
a(x) = w T x + b: pre-activation
g(.): activation function
h = g(a(x)): (output) activation
Feedforward networks 30
Linear activation function
5
0
h
-5
-5 0 5
a
g(a) = a
unbounded, but limited to linear relations
Feedforward networks 31
Sigmoid activation function
1.5
0.5
h
-0.5
-5 0 5
a
1
g(a) = σ(a) =
1 + exp(−a)
bounded between 0 and 1, “squashes” small/large values of a
Feedforward networks 32
Rectified linear activation function
5
0
h
-5
-5 0 5
a
g(a) = max(0, a)
lower-bounded, “squashes” a below 0, favors sparse activations
resulting neuron call rectified linear unit (ReLU)
Feedforward networks 33
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi
• w
• w
• { y1 • w x2
range
•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Feedforward networks 34
Feedforward neural network a.k.a. multilayer perceptron
Input layer Hidden Hidden Output
(features) layer #1 layer #2 layer
Feedforward networks 36
First hidden layer
Hidden layer #1: in scalar/vector notation
(1)
X (1) (1) (1) T (1)
ai = wji xj + bi = wi x + bi
j
(1) (1)
hi = g (1) (ai )
T
Overall: h (1) = g (1) (W (1) x + b (1) ).
Feedforward networks 37
Other hidden layers
Hidden layer #l: in scalar/vector notation
(l)
X (l) (l−1) (l) (l) T (l)
ai = wji hj + bi = wi h (l−1) + bi
j
(l) (l)
hi = g (l) (ai )
T
Overall: h (l) = g (l) (W (l) h (l−1) + b (l) ).
Feedforward networks 38
Output layer
Output layer: in scalar/vector notation
(L+1)
X (L+1) (L) (L+1) (L+1) T (L+1)
ai = wji hj + bi = wi h (L) + bi
j
(L+1)
ŷi = o(ai )
T
Overall: ŷ = o(W (L+1) h (L) + b (L+1) ).
Feedforward networks 39
Function represented by the neural network
...
T
Hidden layer #l: h (l) = g (l) (W (l) h (l−1) + b (l) )
...
T
Output layer: ŷ = o(W (L+1) h (L) + b (L+1) )
Feedforward networks 40
Function represented by the neural network
Feedforward neural network with 1 hidden layer:
ŷ = f (x)
T
= o(W (2) h (1) + b (2) )
T T
= o(W (2) g (1) (W (1) x + b (1) ) + b (2) )
And so on.
Feedforward networks 41
IMPACT OF DEPTH
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi
• w
• w
• { y1 • w x2
range
•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Impact of depth 43
• g(a) = tanh(a) = exp(a)+exp( a) = exp(2a)+
Capacity
decision of a single
boundary of neuron
neuron 7
Réseaux de neurones
• g(a) = max(0, a)
A single neuron can• perform
classification: g(a) =binary classification.
reclin(a) = max(0, a)
terpret For instance,
neuron sigmoid can be
• p(y
as estimating seen as estimating P(y = yes|x).
= 1|x)
Also known as logistic regression classifier.
eic regression
expressive classifier des réseaux
• g(·) b de neurones
decision boundary is linear
(1) (1) linear decision
(2) boundary
(2)
x
• Wi,j2 bi xj h(x)i wi b
if y ≥ 0.5, predict• h(x) = g(a(x))
class yes ⇣ P
R(1) (1)
uches • a(x) = b 1 + W(1) x a(x)i = bi + j W
⇣ ⌘
if y < 0.5, predict• f (x) = o b(2)R+ (2) >
2
w x
x1
class no x2
• p(y = c|x) x1
x2 h
(from Pascal Vincent’s slides) Pexp(a1 )
Impact of depth • o(a) = softmax(a) = . . . Pexp(a
44
C
c exp(ac ) c exp(
ARTIFICIAL NEURON
Capacity of a single neuron
Topics: capacity of single neuron
• Can solvelinearly
Can solve linearly separable
separable problems
problems 21
, x2 )
, x2 )
0 0 0
0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
(x1 , x2 )
1 1
2)
Impact of depth 45
ARTIFICIAL NEURON
, x2
, x2
, x2
Capacity
0
Topics: of a single
capacity neuron
0
of single neuron 0
• Can’t solve
0 non 1linearly separable
0 problems...
1 0 1
Can’t solve non linearly separable problems.
(x (x .. (x1
1 1
AND (x1 , x2 )
1 1
, x2 )
0
?
0
0 1 0 1
(x1 AND (x1 , x2 )
• .... unless
. . unlessthe
theinput
input is
is transformed
transformed in in
a better representation
a better representation
Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. E
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x
, x2 ).ofEn
et AND (x1Impact bas, on présente l’illustration de la fonction XOR(x1 , x2 ) en
depth 46 fon
opics: single hidden layer neural network
Capacity of a R
single hidden layer
éseaux neural network
de neurones
z x2
1
0 1
-1
0
-1
0
-1
1
x1
zk
sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1
0
y1 .7
1 0 1
-1
0 -1.5 cachée j -1
0
-1
0
-1
biais .5
-1
0
-1
1 1 wji 1
x1 1 1 1
x1
entrée i
x1 x2
x2
Impact of depth (from Pascal Vincent’s slides) 47
• La puissance expressive des réseaux de neurones
Capacitysingle
Topics: of a single
hiddenhidden
layer layer
neuralneural network
network
z1 x2
x1
y2 z1 y3
y1
y1 y2 y3 y4
y4
x1 x2
trois couches R1
... R2
R2
R1
x1 x2
x1
Impact of depth 50
Universal approximation theorem
[Barron, 1993] provided some bounds on the size of a single-layer
network needed to approximate a broad class of functions.
Impact of depth 51
Exponential advantage of depth
Impact of depth 52
Exponential advantage of depth
D1 D1D2
(a) (b)
D1D2D3
(c)
To sum up:
for single hidden layer networks: in the worst case, the model
capacity is linear in the model size;
for deeper networks: the model capacity is exponential in the
model size.
Impact of depth 54
DESIGN CHOICES
Supervised training
Needs:
Training = find θ that minimizes the total loss on the training set
1 X
T
min L(fθ (xt ), yt )
θ T
t=1
Design choices 56
Unsupervised training
Needs:
1 X
T
Training = find θ that minimizes the total loss min L(fθ (xt ))
θ T
t=1
Design choices 58
Design choices
The task to be addressed directly translates into choosing:
the output activation function o(.)
the cost function L(., .)
Typically:
define the task in a statistical manner, i.e., learn p(y|x)
(supervised) or p(x) (unsupervised),
derive the output activation function and the cost function
corresponding to maximum likelihood (ML) estimation, i.e.:
Example: CAPTCHA
Design choices 60
Detection
ŷ = o(a) = σ(a)
Interestingly:
Design choices 61
Detection
Design choices 62
Classification
Design choices 63
Classification
Treat as posterior probability estimation problem: ŷi = P(y = i|x).
n
X
ŷi ≥ 0 and ŷi = 1 ⇒ softmax activation function:
i=1
exp(ai )
ŷi = o(a)i = Pn
i 0 =1 exp(ai 0 )
Design choices 64
Classification
ML estimation equivalent to minimizing the cross-entropy:
n
X
L(ŷ, y) = − yi log ŷi
i=1
Design choices 65
Regression
Design choices 66
Regression
ŷ = o(a) = a
Design choices 67
Autoencoder
Autoencoder: neural network trained to predict its input.
Unsupervised learning.
Design choices 68
[email protected] Université
h(x)de Sh
Topics: autoencoder, encoder, decoder, tied=w
Autoencoder hugo.larochelle@us =
•October 17, 2012
Feed-forward neural network trained to repro
the output layer October 16,
• Deco
Abstract
ck
x
des “Autoencoders”. b = o(b
x a
Abstract
W =W = sig
Math for my slides “Autoencoders”.
(tied weights)
for binar
j P
h(x) = bg(a(x)) P
• b l(f (x)) =
• f (x) ⌘ x k (b
xk xk )2 l(f (x)) = k
= sigm(b + Wx) Enco
W
h(x) = g(a
x = sig
b
x = o(b
a(x))
= sigm(c + W⇤ h(x))
•
1
P P
)) = 2 k (b
xk xk )2 l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
Design choices b
x = o(b
a69(x)
(t) (t)
Parameter tying
Design choices 70
Perfect vs. approximate reconstruction
Design choices 71
Undercomplete autoencoder
Design choices 72
EXAMPLE OF DATA SET: M
Example
L AROCHELLE , B ENGIO , L OURADOUR AND L AMBLIN
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Design choices 75
Denoising autoencoder
Design choices 76
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)
• No corrupted inputs (cross-entropy loss)
Subset of learned
(a) Noweights — No
destroyed corrupted inputs
inputs (b)
Design choices 77
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 25% corrupted
Design choices 78
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 50% corrupted
Design choices 79
Synthetic data generation
Synthetic data generation: compute and draw samples from p(x).
Training ingredients:
training set = set of examples xt and associated targets yt
vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.
loss function L(fθ (x), y)
Training = find θ that minimizes the total loss on the training set
1 X
T
L(fθ (xt ), yt )
T
t=1
dL(θ)
The derivative L0 (θ) = gives the slope of L(θ) at point θ.
dθ
It specifies how to scale a small change in the input in order to
obtain the corresponding change in the output:
θ ← θ − L0 (θ)
APTERThese include:
4. NUMERICAL COMPUTATION
local minima, i.e., L(θ) lower than all neighboring values,
local maxima,
saddle points
gure 4.3: Optimization algorithms may fail to find a global minimum when there
ltiple local minima
Stochastic or plateaus
gradient descent present. In the context of deep learning, we gener
86
Vector gradient descent algorithm
In the non-scalar case, consider the gradient of the function:
∂L(θ)
∂θ1
.
∇θ L(θ) = ..
∂L(θ)
∂θK
∂L(θ)
where are partial derivatives.
∂θk
Steepest descent: move θ in the direction opposite of the gradient
θ ← θ − ∇θ L(θ).
∂L(fθ (x), y)
∂θk
!T
∂a (l)
∇w (l) L(ŷ, y) = (l)
∇a (l) L(ŷ, y)
j ∂wj
(l−1)
= hj ∇a (l) L(ŷ, y)
T
In matrix form: ∇W (l) L(ŷ, y) = ∇a (l) L(ŷ, y) h (l−1)
Stochastic gradient descent 92
Backpropagation algorithm
By considering all those expressions together, we obtain the
so-called backpropagation algorithm.
Ø Each calling
box can the
be an method
object
bprop with anoffprop
each box
method, in reverse,
that top-down
computes order.
the value of the
box given its children
This differentiable programming approach
avoidsthe
Ø Calling computing the gradient
fprop method manually.
of each box in
the right order yields forward propagation
Introduced in Theano. Now standard in all
libraries (TensorFlow, PyTorch. . . )
∇θ L(fθ (xt ), yt )
The gradient of the total loss over the training set is therefore:
" #
1 X 1 X
T T
∇θ L(fθ (xt ), yt ) = ∇θ L(fθ (xt ), yt )
T T
t=1 t=1
The subsets must be random, disjoint, and cover the entire dataset.
Terminology:
Ti is called a minibatch
each pass over the entire dataset is called an epoch
Stochastic gradient descent 97
Stochastic gradient descent
20
10
0
x2
−10
−20
−30
−30 −20 −10 0 10 20
Tends to “zigzag” when descending a “canyon”, which increases
x 1
20
10
−10
−20
−30
−30 −20 −10 0 10 20
Converges in fewer iterations than conventional SGD
8.5: Momentum aims primarily to solve two problems: poor condition
matrix and variance in the stochastic gradient. Here, we illustrate how m
mes the firstStochastic
of these two problems. The contour lines depict a 102
gradient descent
quad
Local optima
When properly tuned (learning rate not too large nor too small),
SGD converges to a local minimum.
How many local minima are they? Are they good or bad?
For many years, people believed that large neural networks failed
because of poor local minima.
g 0 (a (l−1) ) is typically ≤ 1.
0.5
h
-0.5
-5 0 5
a
1
g(a) = σ(a) = ⇒ g 0 (a) = σ(a)(1 − σ(a))
1 + exp(−a)
g 0 (a) ≈ 0 for many (large) negative or positive values
2
h
-1
-5 0 5
a (
0 if x < 0
g(a) = max(0, a) ⇒ g 0 (a) =
1 if x ≥ 0
better than sigmoid but still g 0 (a) = 0 for all a ≤ 0 ⇒ problem if a
gets stuck there, solved by batch normalization (see later)
Stochastic gradient descent 107
CONVOLUTIONAL NETWORKS
Convolutional neural networks
Terminology:
x: input (or signal)
k: convolution kernel (or filter)
a: feature map
0 0 0 −1 −1 −1 0 −1 0
K = 0 1 0 K = −1 8 −1 K = −1 5 −1
0 0 0 −1 −1 −1 0 −1 0
1 1 1 1 2 1 −1 −2 −1
1 1 1
K= 1 1 1 K= 2 4 2 K= −2 28 −2
9 16 16
1 1 1 1 2 1 −1 −2 −1
In pratice:
Convolution is commutative:
∞
X ∞
X
Ai,j = (X ∗ K )i,j = (K ∗ X)i,j = Km,n Xi−m,j−n .
m=−∞ n=−∞
Output
aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz
ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz
Figure 9.1: An example of 2-D convolution without kernel-flipping. In this case we restrict 117
Convolutional neural networks
the output to only positions where the kernel lies entirely within the image, called “valid”
Motivation for CNNs
Still, in a deep CNN, each unit in the upper layers can indirectly
interact with many inputs forming its receptive field.
Benefits:
smaller model size,
less overfitting,
faster computation.
Convolutional
CHAPTER 9. CONVOLUTIONAL NETWORKS
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Figure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s3 ,
and also
Figure highlight
9.3: Sparse the input units
connectivity, in x from
viewed that above:
affect this unit. These
We highlight oneunits
outputareunit,
knowns3 ,
Convolutional neural networks 120
as the
and receptive
also highlightfield of s3 . units
the input (Top)When s isaffect
in x that formed
thisbyunit.
convolution with aare
These units kernel
known of
igure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s
nd also highlight the input units in x that affect this unit. These units are know
Receptive field
s the receptive field of s3 . (Top)When s is formed by convolution with a kernel
idth 3, only three inputs affect s 3 . (Bottom)When s is formed by matrix multiplicatio
onnectivity is no longer sparse, so all of the inputs affect s3 .
g1 g2 g3 g4 g5
h1 h2 h3 h4 h5
x1 x2 x3 x4 x5
igure 9.4: The receptive field of the units in the deeper layers of a convolutional networ
larger than the receptive field of the units in the shallow layers. This effect increases
he network includes architectural features like strided convolution (figure 9.12) or poolin
ection 9.3). This means that even though direct connections in a convolutional net ar
ery sparse, units in the deeper layers can be indirectly connected to all or most121of th
Convolutional neural networks
Receptive field
Benefits:
even smaller model size,
even less overfitting.
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
igure 9.5: Parameter sharing: Black arrows indicate the connections that use a particula
arameter
igure 9.5: in two different
Parameter sharing: models.
Black (Top)The black the
arrows indicate arrows indicatethat
connections usesuse
of athe centr
particula
Convolutional neural networks 124
ement of a 3-element kernel in a convolutional model. Due to parameter sharing, th
Parameter sharing
Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
eachExample edge
pixel in the detection
original image and (280 × 320 pixels):
subtracting the value of its neighboring pixel on the
left. This shows the strength of all of the vertically oriented edges in the input image,
whichcanconvolutional:
be a useful operation1 × 2for kernel,
object 280 × 319Both
detection. × 3 images
≈ 2 × are
105280 pixels tall.
The input operations
image is 320 pixels wide while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing 9two elements, and
319
requires fully× connected:
280 × 3 = 267, 280 320 × point
960×floating 280 ×operations
319 ≈ 8(two × 10multiplications
weights, and
≈ 16 × 10 operations
one addition per 9
output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320 × 280× 319 × 280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representingConvolutional
this transformation.
neural networks
The straightforward matrix multiplication algorithm 126
Equivariance
Equivariance: if the input is transformed, the output is transformed
in the same way.
or
(1) (1) (l) (l)
Af = X ∗ Kf Af = H (l−1) ∗ Kf
= pool(H̃f )
(l) (l)
Hf
Convolutional Layer
Detector stage:
Detector layer: Nonlinearity
Nonlinearity
e.g., rectified linear
e.g., rectified linear
X M−1
X N−1
X
Ai,j,f = (X ∗1,2 K )i,j,f = Kfpointwise
,k
depthwise
Xm,n,k Ki+m,j+n,k
k m=0 n=0
Zero-padding
No zero padding (a.k.a. “valid”)
...
...
... ...
... ...
“same”
... ...
... ...
... ...
... ...
... ...
... ...
Figure 9.13: The effect of zero padding on network size: Consider a convolutional network
with a kernel of width six at every layer. In this example, we do not use any pooling, so
Figure
only the9.13:
convolution operation
The effect itself shrinks
of zero padding size: Consider
the network
on network a convolutional
size. (Top)In network
this convolutional
with a kernel
network, of width
we neural
Convolutional do sixany
at every
notnetworks
use layer.
implicit zeroInpadding.
this example,
Thiswe do not
causes theuse any pooling, so
representation to 135
Pooling
1. 1. 1. 0.2
... ...
DETECTOR STAGE
POOLING STAGE
DETECTOR STAGE
FigureConvolutional
9.8: Max pooling introduces invariance. (Top)A view of the middle of the output
neural networks 137
Translation invariance
Stride the input. Here we show how a set of three learned filters and a max pooling unit can learn
to become invariant to rotation. All three filters are intended to detect a hand-written 5.
Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector
unit. The max pooling unit then has a large activation regardless of which detector unit
was activated. We show here how the network processes two different inputs, resulting
Since pooling summarizes the responses over a neighborhood,
in two different detector units being activated. The effect on the pooling unit is roughly
1. 0.2 0.1
Figure 9.10: Pooling with downsampling. Here we use max-pooling with a pool width of
This improves
of two, which computational efficiency because thenextnext
layer.layer
Note has
three and a stride between pools of two. This reduces the representation size by a factor
reduces the computational and statistical burden on the
roughly S times fewer inputs to process.
that the rightmost pooling region has a smaller size, but must be included if we do not
want to ignore some of the detector units.
344
Stride
Efficient implementation based on strided convolution
s1 s2 s3
s1 s2 s3
Strided
convolution
Strided
convolution
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Inefficient implementation
s1 s2 s3
s1 s2 s3
Downsampling
Downsampling
z1 z2 z3 z4 z5
z1 z2 z3 z4 z5
Convolution
Convolution
x1 x2 x3 x4 x5
x1 x2 x3 x4 x5
Figure 9.12: Convolution with a stride. In this example, we use a stride of two.
(Top)Convolution
Figure 9.12:neural with a stride
Convolution length of twoInimplemented in awesingle
use operation.
a stride of(Bot-
Convolutional networks withgreater
with
a stride. this example, two. 140
tom)Convolution
(Top)Convolution with aa stride than
stride length of twoone pixel is mathematically
implemented equivalent
in a single operation. to
(Bot-
Variable size inputs
Solution: vary the pool size but not the number of pools such that
the classification layer always receives the same number of
summary statistics regardless of the input size.
(modern CNNs are deeper) Output of matrix Output of matrix Output of average
multiply: 1,000 units multiply: 1,000 units pooling: 1x1x1,000
Left: 2 convolutional + 1
Output of reshape to Output of reshape to Output of
vector: vector: convolution:
16,384 units 576 units 16x16x1,000
(fixed-sized image)
with stride 4: with stride 4:
3x3 grid: 3x3x64
16x16x64 16x16x64
(variable-sized image)
64x64x64 64x64x64 64x64x64
Figure 9.11: Examples of architectures for classification with convolutional networks. The
specific strides and depths used in this figure are not advisable for real use; they are
designed to be very shallow in order to fit onto the page. Real convolutional networks
also often involve significant amounts of branching, unlike the chain structures used
Convolutional neural networks here for simplicity. (Left)A convolutional network that processes a fixed image 142size.
After alternating between convolution and pooling for a few layers, the tensor for the
Graphical representation of CNN architectures
Simple 2-lobe
ON/OFF ON/OFF
Simple 3-lobe
OFF/ON OFF/ON
It contains three cell types [Hubel & Wiesel, 1981 Nobel Prize]:
layer 4 cells have round receptive fields like retina/LGN cells,
simple
V1) contains 3 cells
cellhave elongated, localized receptive fields,
types
complex cells also respond to elongated features, but are
e receptive invariant
fields aretoround
smalllike
shifts in the
those of position
LGN and of ganglion
the feature.
cells,
ongated
mally
lar
e
hose of
e can lie
and these fireLinksmore to moving lines.
with human vision 152
to Simple
those of cells
hose
V2 cells:
are tuned to simple properties similar to V1: spatial
orientation, frequency, color. . .
but also: figure vs. ground, multiple orientations in different
regions of a receptive field, some attentional modulation.
V4 cells:
are tuned to simple properties similar to V2,
but also: object features of intermediate complexity, like
simple geometric shapes, strong attentional modulation.
Evidence fromaresingle-site
These IT firing
PREDICTIONS: All ofrates.
these objects and images were
never previously seen by the HMO model
d
HMO (M1 IT only)
Unit 1: r 2 = 0.48
Response* of
IT neural site
HMO
1.0
cars
relative distance
chairs
faces
fruits
planes
tables 0.0
Cadieu, Charles F., Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A. Solomon, Najib J.
Majaj, and James J. DiCarlo. "Deep neural networks rival the representation of primate IT cortex for
ons