0% found this document useful (0 votes)

30 views164 pages

Neural Networks1

Uploaded by

madi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views164 pages

Neural Networks1

Uploaded by

madi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 164

NEURAL NETWORKS

Emmanuel Vincent, Inria Nancy – Grand Est

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing
Outline

For students in cognitive science and NLP:

Sep 27: introduction, feedforward networks, impact of depth
Oct 4: design choices, stochastic gradient descent
Oct 11: convolutional networks, links with human vision
Nov 22: advanced training, computer vision applications,
visualization

For students in NLP only:

Dec 7: recurrent networks
Dec 12: generative networks, NLP applications

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 2
Grading

For students in cognitive science:

60%: written exam
40%: lab work

For students in NLP:

30%: written exam
30%: lab work
20%: multiple choice exam (theory only)
20%: software project (group of 3 people)

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 3
Course material

This course is strongly inspired and borrows material from:

Ian Goodfellow, Yoshua Bengio, Aaron Courville

Deep Learning
https://www.deeplearningbook.org/

Hugo Larochelle
Online Course on Neural Networks
http://info.usherbrooke.ca/hlarochelle/neural_
networks/

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 4
Prerequisites

The course will use the following concepts from linear algebra:
scalars (a), vectors (a), matrices (A), tensors
vector norm
matrix multiplication, determinant
eigendecomposition
and the following concepts from statistics:
random variable, discrete/continuous probability distribution
Bernoulli, categorical, Gaussian distributions
expectation, mean, variance, covariance
joint, marginal, conditional probability, chain rule

University of Lorraine – MSc in Cognitive Science & MSc in Natural Language Processing 5
INTRODUCTION
Artificial intelligence & Machine learning

Artificial intelligence (AI) = creating machines that assist humans.

Early days of AI: solve problems defined by formal rules, e.g.,

IBM’s Deep Blue chess-playing system.

Machine learning (ML) = learn from data, no hardcoded rules.

Raw data (e.g., image = pixel values) correlates poorly with the
desired output. Hence conventional ML:
define relevant features of the data
map them to the desired output.

Performance depends heavily on the chosen features.

Introduction 7
Impact of chosen features
PTER 1. INTRODUCTION
Example: separate two categories of data by drawing a line

 


 

e 1.1: Example of diﬀerent representations: suppose we want to separate

Introduction 8
How to define relevant features?

For many tasks, manually defining relevant features is difficult.

Example: car detection in images.

Building a feature indicating the presence of a wheel is hard due to

variable shape, illumination, etc.

Also, the wheel may be hidden by an obstacle, or simply missing.

Introduction 9
Example data variabilities
6 Olga Russakovsky* et al.

Introduction 10
More data variabilities

Fig. 1 The diversity of data in the ILSVRC image classification and single-object localization tasks. For each of the eight
dimensions, we showIntroduction
example object categories along the range of that property. Object scale, number of instances and image
11
Combined variabilities Olga Russakovsky* e

PASCAL ILSVRC
birds

···
cats

···
dogs

···

g. 2 The ILSVRC dataset contains many more fine-grained classes compared to the standard PASCAL VOC benchm
example, instead of the PASCAL “dog” category there are 120 different breeds of dogs in ILSVRC2012-2014 classifica
Introduction
d single-object localization tasks. 12
Representation learning & Deep learning

Representation learning = learn optimal features from data.

Deep learning achieves this goal by learning complex features

expressed in terms of simpler features.

Each feature is computed by means of a simple parameterizable

function called neuron.

The neurons may be arranged into layers.

The set of all interconnected neurons forms a neural network.

Introduction 13
CHAPTER 1. INTRODUCTION

Deep learning
Output
CAR PERSON ANIMAL
(object identity)

3rd hidden layer

(object parts)

2nd hidden layer

(corners and
contours)

1st hidden layer

(edges)

Visible layer
(input pixels)

Figure 1.2: Illustration

Introduction
of a deep learning model. It is diﬃcult for a computer to understand 14
the meaning of raw sensory input data, such as this image represented as a collection
CHAPTER 1. INTRODUCTION

Overview
Output

Mapping from
Output Output
features

Additional
Mapping from Mapping from layers of more
Output
features features abstract
features

Hand- Hand-
Simple
designed designed Features
features
program features

Input Input Input Input

Deep
Classic learning
Rule-based
machine
systems Representation
learning
learning

Figure 1.5: Flowcharts showing how the diﬀerent parts of an AI system relate to each
Introduction 15
other within diﬀerent AI disciplines. Shaded boxes indicate components that are able to
Brief history of neural networks

0.000250
Frequency of Word or Phrase

cybernetics
0.000200
(connectionism + neural networks)
0.000150

0.000100

0.000050

0.000000
1940 1950 1960 1970 1980 1990 2000
Year

Figure 1.7: The ﬁgure shows (deep

twolearning not historical
of the three shown here)waves of artiﬁcial neural nets
research, as measured by the frequency of the phrases “cybernetics” and “connectionism” or
“neural networks” according to Google Books (the third wave is too recent to appear). The
ﬁrst wave started with cybernetics in the 1940s–1960s, with the development of theories
Introduction 16
Cybernetics (1940–1970)
Linear neuron = take input features x1 ,. . . ,xN and compute

f (x) = w1 x1 + · · · + wN xN

Weights w1 , . . . , wN fixed manually or learned from data by

stochastic gradient descent (detailed later).

Limited representation capability, e.g., cannot represent XOR:

f ([0, 1]) = 1
f ([1, 0]) = 1
f ([1, 1]) = 0
f ([0, 0]) = 0

Such limitations caused a first backlash against neural networks.

Introduction 17
Connectionism (1980–2000)
Cognitive scientists moved from symbolic reasoning to models of
cognition that could be grounded in neural implementations.

Central idea: many simple computational units can achieve various

intelligent behaviors when networked together.

Key concepts:
distributed representation: each data is represented by many
features and vice-versa, e.g., use separate neurons for car
make and color rather than neurons for every combination,
backpropagation algorithm to train neural networks (detailed
later), limited to shallow networks in practice.
Unrealistic commercial claims and advances in other areas (SVM,
graphical models) led to a second backlash.

Introduction 18
Deep learning (2006–?)
In 2006, Geoffrey Hinton (University of Toronto) proposed an
efficient training strategy called greedy layerwise pretraining.

His group and those of Yoshua Bengio (University of Montreal)

and Yann LeCun (New York University) showed that it helped
training deeper neural networks than had been possible before.

This third wave of popularity is still ongoing.

In hindsight, the success of deep learning is not due to greedy

layerwise pretraining which is not used anymore but to
increased dataset sizes
increased model sizes
increased computational power.

Introduction 19
Increasing dataset sizes

The learning algorithms reaching human performance on complex

tasks today are nearly identical to those which struggled to solve
toy problems in the 1980s.

Big data has made machine learning easier. With larger datasets:
the amount of human expertise required to avoid overfitting
reduces
accuracy improves (think “law of large numbers”).

As a rule of thumb, supervised deep learning works well with

10 million labeled training examples
5,000 labeled examples per category.

Introduction 20
Increasing dataset sizes

10 9
Dataset size (number examples)

10 8 Canadian Hansard
WMT Sports-1M
10 7 ImageNet10k
10 6 Public SVHN
10 5 Criminals ImageNet ILSVRC 2014
10 4
MNIST CIFAR-10
10 3
10 2 T vs. G vs. F Rotated T vs. C
10 1 Iris

10 0
1900 1950 1985 2000 2015
Year
ure 1.8: Dataset sizes have increased greatly over time. In the early 1900s, statist
died datasets using hundreds or thousands of manually compiled measurements (G
0; Gosset, 1908 ; Anderson, 1935; Fisher, 1936). In the 1950s through 1980s, the
Introduction 21 pi
Increasing model sizes

Biological neurons are not densely connected.

Artificial neurons already reached a similar number of connections.

Introduction 22
Increasing model sizes

10 4 Human

6 Cat
Connections per neuron

9 7
4
10 3 Mouse
2
10
5
8
10 2 Fruit ﬂy
3
1

10 1
1950 1985 2000 2015

(1 to 10 = various neural networks)

Year
gure 1.10: Initially, the number of connections between neurons in artiﬁcial n
tworks was limited by hardware capabilities. Today, the number of connections be
urons is mostly a design consideration. Some artiﬁcial neural networks have nea
Introduction 23
any connections per neuron as a cat, and it is quite common for other neural net
Increasing model sizes

Rather than the number of connections, the total number of

neurons is key (think “brain”).

Larger networks can represent deeper, more complex concepts.

Artificial neural networks have doubled in size roughly every 2.4

years, but have remained small until quite recently.

This growth is driven by

computers with larger memory
suitable libraries (Theano, Torch, Caffe, Keras, TensorFlow,
PyTorch. . . ) exploiting graphical processing units (GPU)
the availability of larger datasets.

Introduction 24
Increasing model sizes
Number of neurons (logarithmic scale)

1011 Human
1010
17 20
109 16 19 Octopus
108 14 18
107 11 Frog
106 8
105 3 Bee
Ant
104
103 Leech
102 13
101 1 2 12 15 Roundworm
100 6 9
5 10
10−1 4 7
10−2 Sponge
1950 1985 2000 2015 2056
Year
(1 to 20 =
ure 1.11: Since the introduction of various neural artiﬁcial
hidden units, networks)neural networks have do
size roughly every 2.4 years. Biological neural network sizes from Wikipedia (20
Introduction 25
1. Perceptron (Rosenblatt, 1958, 1962)
Example image classification performance
ImageNet Large Scale Visual Recognition Challenge (1,000 classes)

Introduction 26
Example speech recognition performance

Human error rate in phone conversations (Switchboard): 5 to 10%

Introduction 27
FEEDFORWARD NETWORKS
Artificial neuron

Neuron: multiple-input single-output parametric nonlinear function.

Also known as perceptron or elementary computing unit.

Typically:
parametric linear transform
then fixed scalar nonlinear function.

x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3

Feedforward networks 29
Terminology

x1 !
X
x2 h h=g wn xn + b = g(w T x + b)
n
x3

x = [x1 , . . . , xN ]T : inputs
w = [w1 , . . . , wN ]T : weights
b: bias
a(x) = w T x + b: pre-activation
g(.): activation function
h = g(a(x)): (output) activation

Feedforward networks 30
Linear activation function
5

0
h

-5
-5 0 5
a

g(a) = a
unbounded, but limited to linear relations

Feedforward networks 31
Sigmoid activation function
1.5

0.5
h

-0.5
-5 0 5
a
1
g(a) = σ(a) =
1 + exp(−a)
bounded between 0 and 1, “squashes” small/large values of a

Feedforward networks 32
Rectified linear activation function
5

0
h

-5
-5 0 5
a

g(a) = max(0, a)
lower-bounded, “squashes” a below 0, favors sparse activations
resulting neuron call rectified linear unit (ReLU)

Feedforward networks 33
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi

• w
• w
• { y1 • w x2
range

•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Feedforward networks 34
Feedforward neural network a.k.a. multilayer perceptron
Input layer Hidden Hidden Output
(features) layer #1 layer #2 layer

 DEEP LEARNING FOR

BASED
 (1)   (2) 
h1 h1

DISTANT-MICROPHONE
SPEECH ENHANCEMENT SPEECH
x1  (1)   (2) 
h  h  y
x = x2  h =  (1)  h =  (2) 
(1) 2 (2) 2 ŷ = 1
x3 ENHANCEMENT AND RECOGNITION —
h3  h3  y2
(1) (2)
h4 h4
EXPECTED AND UNEXPECTED RESULTS
Feedforward networks 35
Feedforward neural network a.k.a. multilayer perceptron

Every neuron has different parameters w and b.

We index them as follows:

wji : weight of input j for neuron i at layer l
(l)

bi : bias for neuron i at layer l

(l)

Feedforward networks 36
First hidden layer
Hidden layer #1: in scalar/vector notation

(1)
X (1) (1) (1) T (1)
ai = wji xj + bi = wi x + bi
j
(1) (1)
hi = g (1) (ai )

In matrix notation (note: g (1) applied to each entry of a (1) ):

T
a (1) = W (1) x + b (1)
h (1) = g (1) (a (1) )

T
Overall: h (1) = g (1) (W (1) x + b (1) ).

Feedforward networks 37
Other hidden layers
Hidden layer #l: in scalar/vector notation

(l)
X (l) (l−1) (l) (l) T (l)
ai = wji hj + bi = wi h (l−1) + bi
j
(l) (l)
hi = g (l) (ai )

In matrix notation (note: g (l) applied to each entry of a (l) ):

T
a (l) = W (l) h (l−1) + b (l)
h (l) = g (l) (a (l) )

T
Overall: h (l) = g (l) (W (l) h (l−1) + b (l) ).

Feedforward networks 38
Output layer
Output layer: in scalar/vector notation

(L+1)
X (L+1) (L) (L+1) (L+1) T (L+1)
ai = wji hj + bi = wi h (L) + bi
j
(L+1)
ŷi = o(ai )

with o(.) output activation function.

In matrix notation (note: o applied to each entry of a (L+1) ):

T
a (L+1) = W (L+1) h (L) + b (L+1)
ŷ = o(a (L+1) )

T
Overall: ŷ = o(W (L+1) h (L) + b (L+1) ).
Feedforward networks 39
Function represented by the neural network

Feedforward neural network with L hidden layers:

T
Hidden layer #1: h (1) = g (1) (W (1) x + b (1) )
T
Hidden layer #2: h (2) = g (2) (W (2) h (1) + b (2) )

...
T
Hidden layer #l: h (l) = g (l) (W (l) h (l−1) + b (l) )

...
T
Output layer: ŷ = o(W (L+1) h (L) + b (L+1) )

Feedforward networks 40
Function represented by the neural network
Feedforward neural network with 1 hidden layer:
ŷ = f (x)
T
= o(W (2) h (1) + b (2) )
T T
= o(W (2) g (1) (W (1) x + b (1) ) + b (2) )

Feedforward neural network with 2 hidden layers:

ŷ = f (x)
T
= o(W (3) h (2) + b (3) )
T T
= o(W (3) g (2) (W (2) h (1) + b (2) ) + b (3) )
T T T
= o(W (3) g (2) (W (2) g (1) (W (1) x + b (1) ) + b (2) ) + b (3) )

And so on.
Feedforward networks 41
IMPACT OF DEPTH
P
• a(x) = b + w i xi = b + w > x
Topics: connection weights, bias, activationP
i function
•w
Graphical representation of •
• h(x) = g(a(x)) = g(b +
aPsingle
a(x)neuron
= bi
i w i xi )
• h(x)
• h(x) = g(a(x)) = gP
= g(b i
i
•wia(x)
xi ==bbi

• w
• w
• { y1 • w x2
range

•{
1
ange determined • { -1
•by g(·)determined
b
by g(.)
0 1
-1 • g(·)
bias b on
0
changes th
-1
0 biais
bias b only
position
.5 o
-1
changes the
1 the riff
x1 position of
the riff
(from Pascal Vincent’s slides)
Impact of depth 43
• g(a) = tanh(a) = exp(a)+exp( a) = exp(2a)+

Capacity
decision of a single
boundary of neuron
neuron 7
Réseaux de neurones
• g(a) = max(0, a)
A single neuron can• perform
classification: g(a) =binary classification.
reclin(a) = max(0, a)
terpret For instance,
neuron sigmoid can be
• p(y
as estimating seen as estimating P(y = yes|x).
= 1|x)
Also known as logistic regression classifier.
eic regression
expressive classifier des réseaux
• g(·) b de neurones
decision boundary is linear
(1) (1) linear decision
(2) boundary
(2)
x
• Wi,j2 bi xj h(x)i wi b
if y ≥ 0.5, predict• h(x) = g(a(x))
class yes ⇣ P
R(1) (1)
uches • a(x) = b 1 + W(1) x a(x)i = bi + j W
⇣ ⌘
if y < 0.5, predict• f (x) = o b(2)R+ (2) >
2
w x
x1
class no x2
• p(y = c|x) x1
x2 h
(from Pascal Vincent’s slides) Pexp(a1 )
Impact of depth • o(a) = softmax(a) = . . . Pexp(a
44
C
c exp(ac ) c exp(
ARTIFICIAL NEURON
Capacity of a single neuron
Topics: capacity of single neuron
• Can solvelinearly
Can solve linearly separable
separable problems
problems 21

OR (x1 , x2 ) AND (x1 , x2 ) AND (x1 , x2 )

1 1 1
, x2 )

, x2 )

, x2 )
0 0 0

0 1 0 1 0 1
(x1 (x1 (x1
XOR (x1 , x2 ) XOR (x1 , x2 )
(x1 , x2 )

1 1
2)

Impact of depth 45
ARTIFICIAL NEURON
, x2

, x2

, x2
Capacity
0
Topics: of a single
capacity neuron
0
of single neuron 0

• Can’t solve
0 non 1linearly separable
0 problems...
1 0 1
Can’t solve non linearly separable problems.
(x (x .. (x1
1 1

XOR (x1 , x2 ) XOR (x1 , x2 )

AND (x1 , x2 )
1 1
, x2 )

0
?
0

0 1 0 1
(x1 AND (x1 , x2 )

• .... unless
. . unlessthe
theinput
input is
is transformed
transformed in in
a better representation
a better representation
Figure 1.8 – Exemple de modélisation de XOR par un réseau à une couche cachée. E
haut, de gauche à droite, illustration des fonctions booléennes OR(x1 , x2 ), AND (x1 , x
, x2 ).ofEn
et AND (x1Impact bas, on présente l’illustration de la fonction XOR(x1 , x2 ) en
depth 46 fon
opics: single hidden layer neural network
Capacity of a R
single hidden layer
éseaux neural network
de neurones
z x2
1

0 1
-1
0
-1
0
-1
1
x1
zk

sortie k
y1 x2 y2
x2
y2 wkj
1 -1 -.4 1

0
y1 .7
1 0 1
-1
0 -1.5 cachée j -1
0
-1
0
-1
biais .5
-1
0
-1
1 1 wji 1
x1 1 1 1
x1

entrée i
x1 x2
x2
Impact of depth (from Pascal Vincent’s slides) 47
• La puissance expressive des réseaux de neurones
Capacitysingle
Topics: of a single
hiddenhidden
layer layer
neuralneural network
network
z1 x2

x1
y2 z1 y3

y1
y1 y2 y3 y4
y4

x1 x2

(from Pascal Vincent’s slides)

Impact of depth 48
Topics:
Capacitysingle hidden
of a single layer layer
hidden neuralneural
network
network
R2
x1 x2
x1
x2

trois couches R1

... R2
R2
R1
x1 x2
x1

(from Pascal Vincent’s slides)

Impact of depth 49
Universal approximation theorem

A linear model can represent linear functions only.

Which class of functions can a neural network represent? Does it

depend on the chosen activation functions?

Universal approximation theorem [Hornik 1991]: “A single hidden

layer neural network with any “squashing” activation function and
with a linear output unit can approximate any continuous function
arbitrarily well, given enough hidden units.”

In other words, regardless of what function we are trying to learn, a

large single hidden layer network will be able to represent it.

How large is this network (i.e., how many hidden units)?

Impact of depth 50
Universal approximation theorem
[Barron, 1993] provided some bounds on the size of a single-layer
network needed to approximate a broad class of functions.

In the worst case, an exponential number of hidden units (possibly

with one hidden unit corresponding to each input configuration
that needs to be distinguished) may be required.

Example: the number of possible binary functions on vectors

v ∈ {0, 1}n is 22 . Representing one such function requires 2n bits,
n

which will in general require in the order of 2n parameters, i.e., the

model capacity is linear in the model size.

In summary, a network with a single layer is sufficient to represent

any function, but this network can generally not be learned due to
its huge size (too many parameters) which results in overfitting.

Impact of depth 51
Exponential advantage of depth

[Montufar, 2014]: the number of linear regions modeled by a

piecewise linear network (i.e., a network with ReLU neurons) with
d inputs, l layers, and n units per layer is in the order of
d(l−1)
n
nd
d

i.e., the model capacity is exponential in the depth l (hence in the

model size).

These regions are defined by all possible products of individual

columns of W (1) , W (2) . . .

Impact of depth 52
Exponential advantage of depth
D1 D1D2

(a) (b)
D1D2D3

(c)

Figure 6. An ML-CSC model trained

(a): W trained on MNIST (each column = one image)
(1) on the MNIST data set. (a) The local filters of the dictionary D 1 . (b) The local filters of the effective dictionary
D (2) = D 1 D 2 . (c) Some of the 1,024 local atoms of the effective dictionary D (3), which are global atoms of size 28 # 28.

(b): products of columns of W (1) and W (2)

Definition (c): some products of columnsasof W (1)The, W
molecules. same signaland
(2)
can beW
(3)
described this way using
The set of K-layered ML-CSC signals of cardinalities {k 1, D 1 D 2 D 3, and so on, all the way to D 1 D 2 gD K C K .
k 2, f, k K } over the convolutional dictionaries {D 1, D 2, f, D K } Perhaps the following explanation could help in giving intu-
"" D i ,Ki =of1,depth
is defined as MSImpact " k i ,Ki = 1 ,. A signal X belongs to this ition to this wealth of descriptions of X. A human-being body
53can
Exponential advantage of depth

To sum up:
for single hidden layer networks: in the worst case, the model
capacity is linear in the model size;
for deeper networks: the model capacity is exponential in the
model size.

Deeper networks make it possible to learn more complex functions

without overfitting, provided that that these functions involve the
composition of several simpler functions.

Impact of depth 54
DESIGN CHOICES
Supervised training
Needs:

labeled training set = set of examples xt and associated

targets yt related to the task (classification, detection,
regression, etc.)

vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.

loss function L(fθ (x), y)

Training = find θ that minimizes the total loss on the training set

1 X
T
min L(fθ (xt ), yt )
θ T
t=1

Design choices 56
Unsupervised training

Needs:

unlabeled training set = set of examples xt without targets

vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.
loss function L(fθ (x))

1 X
T
Training = find θ that minimizes the total loss min L(fθ (xt ))
θ T
t=1

Less common in deep learning until recently. Typically used to:

extract an embedding fθ (x) used as input to a supervised task,
or pretrain a model fθ (.) on a large unlabeled dataset that will
be fine-tuned for a supervised task on a smaller labeled set,
or train a generative model for synthetic data generation.
Design choices 57
Other forms of training

Variants (not studied hereafter) include:

semi-supervised training: some examples xt labeled, some not

weakly-supervised training: labels yT related to collections of

examples {xt }t∈T , rather than individual examples xt

reinforcement learning: observed xt depends on the actions

taken by the learning agent until time t.

Design choices 58
Design choices
The task to be addressed directly translates into choosing:
the output activation function o(.)
the cost function L(., .)

Typically:
define the task in a statistical manner, i.e., learn p(y|x)
(supervised) or p(x) (unsupervised),
derive the output activation function and the cost function
corresponding to maximum likelihood (ML) estimation, i.e.:

L(ŷ, y) ∝ − log p(y|x) or − log p(x).

log p rather than p itself because total cost = sum of individual

costs rather than product.
Design choices 59
Detection

Detection: f (x) = y ∈ {0, 1}.

Example: CAPTCHA

Design choices 60
Detection

Treat as posterior probability estimation problem: ŷ = P(y = 1|x).

ŷ ∈ [0, 1] ⇒ sigmoid output activation function:

ŷ = o(a) = σ(a)

Interestingly:

P(y = 1|x) σ(a)

log = log =a
P(y = 0|x) 1 − σ(a)

⇒ a interpretable as the log-likelihood ratio between the two

classes

Design choices 61
Detection

Bernoulli likelihood function:

(
ŷ if y = 1
P(y |x) =
1 − ŷ if y = 0

ML estimation equivalent to minimizing the cross-entropy:

L(ŷ , y ) = −y log ŷ − (1 − y ) log(1 − ŷ ).

Design choices 62
Classification

Classification (three or more classes):

f (x) = y ∈ {1, . . . , n}

Example: ImageNet (1,000 classes)

Design choices 63
Classification
Treat as posterior probability estimation problem: ŷi = P(y = i|x).
n
X
ŷi ≥ 0 and ŷi = 1 ⇒ softmax activation function:
i=1

exp(ai )
ŷi = o(a)i = Pn
i 0 =1 exp(ai 0 )

Categorical likelihood function:



ŷ1
 if y = 1
.
P(y |x) = ..


ŷ if y = n
n

Design choices 64
Classification
ML estimation equivalent to minimizing the cross-entropy:
n
X
L(ŷ, y) = − yi log ŷi
i=1

with y the one-hot vector representing the class y :

 
0
 .. 
.
 
0
 
y = 1 ← y -th position

0
 
 .. 
.
0

Design choices 65
Regression

Regression: f (x) = transformed real-valued vector ŷ

Example: estimate house location, size, price, etc., from an image

Design choices 66
Regression

ŷ typically unbounded ⇒ linear output activation function:

ŷ = o(a) = a

Gaussian likelihood function with fixed covariance σ 2 I:

p(y|x) = N (y; ŷ, σ 2 I).

ML estimation equivalent to minimizing the mean squared error:

L(ŷ, y) = kŷ − yk2 .

Also applicable to unsupervised compression aka. autoencoder.

Design choices 67
Autoencoder
Autoencoder: neural network trained to predict its input.

Unsupervised learning.

Consists of two parts:

an encoder function h = f (x)
a decoder function x̂ = r (h) such that x̂ ≈ x

The hidden activations h provide a nonlinear representation of the

input called an embedding.

Typically, the decoder has a linear output activation function and

performance is measured by the squared error loss

L(r (f (x)), x) = kr (f (x)) − xk2

Design choices 68
[email protected] Université
h(x)de Sh
Topics: autoencoder, encoder, decoder, tied=w
Autoencoder hugo.larochelle@us =
•October 17, 2012
Feed-forward neural network trained to repro
the output layer October 16,
• Deco
Abstract
ck
x
des “Autoencoders”. b = o(b
x a
Abstract
W =W = sig
Math for my slides “Autoencoders”.
(tied weights)
for binar
j P
h(x) = bg(a(x)) P
• b l(f (x)) =
• f (x) ⌘ x k (b
xk xk )2 l(f (x)) = k
= sigm(b + Wx) Enco
W
h(x) = g(a
x = sig
b
x = o(b
a(x))
= sigm(c + W⇤ h(x))
•
1
P P
)) = 2 k (b
xk xk )2 l(f (x)) = k (xk log(b
xk ) + (1 xk ) log(1 bk ))
x
Design choices b
x = o(b
a69(x)
(t) (t)
Parameter tying

The parameters of the encoder and the decoder may be different,

or they may be tied (i.e., their weight matrices are transposed
versions of each other).

When applying SGD to an autoencoder with tied parameters, the

gradient of the encoder and the transposed gradient of the decoder
must be summed together.

Design choices 70
Perfect vs. approximate reconstruction

Perfect reconstruction ŷ = r (f (x)) = x is useless.

Instead, autoencoders are designed to reconstruct approximately by

making h smaller than x,
or adding noise to the training data,
or adding a regularization term (detailed later)

Because the model is forced to prioritize which aspects of the input

should be reproduced, the embedding h often represents salient
features of the data.

Design choices 71
Undercomplete autoencoder

Undercomplete autoencoder = encoder where the embedding h

has fewer dimensions than the input x.

Achieves a form of “compression”.

If f (.) and r (.) are linear, this is equivalent to principal component

analysis (PCA).

An undercomplete autoencoder is thus a nonlinear generalization

of PCA.

Design choices 72
EXAMPLE OF DATA SET: M
Example
L AROCHELLE , B ENGIO , L OURADOUR AND L AMBLIN

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5: Samples fromMNIST

the MNISThandwritten
digit recognitiondigits
data set.dataset
Here, a black pixel corresponds to
an input value of 0 and a white pixel corresponds to 1 (the inputs are scaled between 0
and 1).
Design choices 73
FILTERS (AUTOENCODER
a white pixel to a weight larger than 3, with the different shades of gray corresponding
−3 and 3. et al., JMLR2009)
to different weight values uniformly between (Larochelle 93
Example
1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Subset of learned weights (each column of W shown as an image)

6.7 – Display
Figure 6.6 Input weights
of the of a random
input weightssubset of the hidden
of a random subset units,
of thelearned by an learned
hidden units, autoas-
sociator
by an RBMwhen 250-dimensional
trained
when trained embedding
on samples from
on samples from forMNIST
the the
MNISTan input dimension
dataset. TheThe
dataset. of settingofisunits
display
activation the
same
of theas forhidden
first 6.6. is obtained28
Figurelayer by × 28 product
a dot = 784 of such a weight “image” with the
input image. Design
In these images, a black pixel corresponds to a weight smaller than −374and
choices
Denoising autoencoder

Denoising autoencoder: change the loss function to

L(r (f (x̃)), x) = kr (f (x̃)) − xk2

with x̃ a copy of x corrupted by some form of noise.

Encoder must learn to undo this corruption, which forces it to

learn salient features, rather than noise originally present in x.

Makes it possible to obtain embeddings h whose dimension is

bigger than the data x.

Design choices 75
Denoising autoencoder

Design choices 76
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)
• No corrupted inputs (cross-entropy loss)

Subset of learned
(a) Noweights — No
destroyed corrupted inputs
inputs (b)

Design choices 77
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 25% corrupted

nputs Subset of learned weights

(a) No
(b) 25% — 25%
destroyed corrupted inputs
inputs
destruction
(b) 25%
(c) 5

Design choices 78
(Vincent, Larochelle, Bengio and Manzagol, ICML 2008)
Example (MNIST)inputs
• 50% corrupted

tion Subset of learned weights

(a) No — 50%
destroyed
(c) 50% corrupted inputs
inputs
destruction
(b) 25%

Design choices 79
Synthetic data generation
Synthetic data generation: compute and draw samples from p(x).

For discrete data, treat as a series of classification tasks:

p(x) = p(x1 ) × p(x2 |x1 ) × . . . p(xn |x1 , . . . , xn−1 ).

Example: language modeling / text generation

For continuous data, more complicated (see NLP-only course).

Design choices 80
STOCHASTIC GRADIENT DESCENT
Training ingredients
In this section, focus on supervised learning. Similar derivations
may be conducted for unsupervised or other forms of learning.

Training ingredients:
training set = set of examples xt and associated targets yt
vector of model parameters θ = {W (1) , b (1) , W (2) , b (2) , . . . }.
loss function L(fθ (x), y)

Training = find θ that minimizes the total loss on the training set
1 X
T
L(fθ (xt ), yt )
T
t=1

Nonlinear function of θ ⇒ gradient-based optimization

Stochastic gradient descent 82

Scalar gradient descent algorithm
Suppose we have a scalar function L(θ) (1 single parameter θ).

dL(θ)
The derivative L0 (θ) = gives the slope of L(θ) at point θ.
dθ
It specifies how to scale a small change in the input in order to
obtain the corresponding change in the output:

L(θ + ) ≈ L(θ) + L0 (θ).

Gradient descent: repeatedly moving θ in small steps opposite of

the gradient direction

θ ← θ − L0 (θ)

makes L(θ) gradually smaller

Stochastic gradient descent 83
Scalar gradient descent algorithm

Stochastic gradient descent 84

Stationary points
Values of θ for which L0 (θ) = 0 are called stationary points.

APTERThese include:
4. NUMERICAL COMPUTATION
local minima, i.e., L(θ) lower than all neighboring values,
local maxima,
saddle points

Minimum Maximum Saddle point

Stochastic gradient descent 85

Local vs. global minimum
APTER 4. NUMERICAL COMPUTATION
Under mild conditions, gradient descent converges to a local
minimum, but not necessarily to the global minimum.

This local minimum

performs nearly as well as
the global one,
so it is an acceptable
halting point.
f (x)

Ideally, we would like

to arrive at the global
minimum, but this
might not be possible.
This local minimum performs
poorly and should be avoided.

gure 4.3: Optimization algorithms may fail to ﬁnd a global minimum when there
ltiple local minima
Stochastic or plateaus
gradient descent present. In the context of deep learning, we gener
86
Vector gradient descent algorithm
In the non-scalar case, consider the gradient of the function:
 
∂L(θ)
 ∂θ1 
 . 
∇θ L(θ) =   .. 

 ∂L(θ) 
∂θK

∂L(θ)
where are partial derivatives.
∂θk
Steepest descent: move θ in the direction opposite of the gradient

θ ← θ − ∇θ L(θ).

Learning rate typically fixed to a small value.

Stochastic gradient descent 87
Vector gradient descent

Stochastic gradient descent 88

Computing the gradient of a neural network
To apply this approach to a feedforward neural network, we must
compute the partial derivative of the loss with respect to all
parameters θk (i.e., the weights and the biases of all layers):

∂L(fθ (x), y)
∂θk

To do so, we use the chain rule of derivation: if y = g(x) and

z = f (y), then
∂z X ∂z ∂yj
= .
∂xi ∂yj ∂xi
j
T
∂y ∂y
In matrix form: ∇x z = ∇y z with the Jacobian
∂x ∂x
matrix of g.
Stochastic gradient descent 89
Gradients in output layer
Gradient w.r.t. the outputs:

∇ŷ L(ŷ, y) = depends on the chosen loss L

Gradient w.r.t. output pre-activations:

T
∂ ŷ
∇a (L+1) L(ŷ, y) = ∇ŷ L(ŷ, y)
∂a (L+1)
= o 0 (a (L+1) ) ∇ŷ L(ŷ, y)

where denotes elementwise multiplication

In the special case of a linear output layer ŷ = o(a) = a and a

squared error loss L(ŷ, y) = kŷ − yk2 , we have:

∇ŷ L(ŷ, y) = ∇a (L+1) L(ŷ, y) = 2(ŷ − y)

Stochastic gradient descent 90
Gradients in hidden layer

Gradient w.r.t. the activations:

!T
∂a (l+1)
∇h (l) L(ŷ, y) = ∇a (l+1) L(ŷ, y)
∂h (l)
= W (l+1) ∇a (l+1) L(ŷ, y)

Gradient w.r.t. the pre-activations:

!T
∂h (l)
∇a (l) L(ŷ, y) = ∇h (l) L(ŷ, y)
∂a (l)
= g 0 (a (l) ) ∇h (l) L(ŷ, y)

Stochastic gradient descent 91

Gradients w.r.t. weights and biases
Gradient w.r.t. the biases:
!T
∂a (l)
∇b (l) L(ŷ, y) = ∇a (l) L(ŷ, y)
∂b (l)
= ∇a (l) L(ŷ, y)

Gradient w.r.t. a row wj of the weight matrix W (l) :

(l)

!T
∂a (l)
∇w (l) L(ŷ, y) = (l)
∇a (l) L(ŷ, y)
j ∂wj
(l−1)
= hj ∇a (l) L(ŷ, y)

T
In matrix form: ∇W (l) L(ŷ, y) = ∇a (l) L(ŷ, y) h (l−1)
Stochastic gradient descent 92
Backpropagation algorithm
By considering all those expressions together, we obtain the
so-called backpropagation algorithm.

Compute the gradient g w.r.t. the output pre-activations:

g ← ∇ŷ L(ŷ, y)
g ← ∇a (L+1) L(ŷ, y) = o 0 (a (L+1) ) g
For l = L + 1, L − 1, . . . , 1:
Compute gradient w.r.t. biases and weights:
∇b (l) L(ŷ, y) = g
T
∇W (l) L(ŷ, y) = g h (l−1)
Propagate gradient to lower layer:
g ← ∇h (l−1) L(ŷ, y) = W (l) g
g ← ∇a (l−1) L(ŷ, y) = g 0 (a (l−1) ) g
Endfor
Stochastic gradient descent 93
Computational
Implementation Flow
of forward pass Graph
• Forward propagation can be represented
as an acyclic flow graph
In practice, forward pass implemented by
• Forward propagation can be implemented
modeling
in a modular way: the network as an acyclic
computational flow graph,
Ø Each box can be an object with an fprop
associating each box with an fprop
method, that computes the value of the
method, that computes the value of
box given its children
the box given its children,
Ø calling
Calling themethod
the fprop method
fprop of of each
each box in box
the right in
order yields forward
bottom-up order. propagation

Stochastic gradient descent 94

Computational
Implementation Flow
of backpropagation Graph
• Forward propagation
Similarly, can be represented
backpropagation implemented by
as an acyclic flow graph
associating each box with a bprop
• Forwardmethod,
propagation can be implemented
that computes the gradient
in a modular way:
with respect to each child box,

Ø Each calling
box can the
be an method
object
bprop with anoffprop
each box
method, in reverse,
that top-down
computes order.
the value of the
box given its children
This differentiable programming approach
avoidsthe
Ø Calling computing the gradient
fprop method manually.
of each box in
the right order yields forward propagation
Introduced in Theano. Now standard in all
libraries (TensorFlow, PyTorch. . . )

Stochastic gradient descent 95

Back to gradient descent
Using backpropagation, we can compute the gradient of the loss
for each training example (xt , yt ) w.r.t. the network parameters θ:

∇θ L(fθ (xt ), yt )

The gradient of the total loss over the training set is therefore:
" #
1 X 1 X
T T
∇θ L(fθ (xt ), yt ) = ∇θ L(fθ (xt ), yt )
T T
t=1 t=1

Computing the gradient exactly is very expensive because it

requires a pass on the entire dataset at every iteration of the
gradient descent algorithm.

Stochastic gradient descent 96

Stochastic gradient descent
Stochastic gradient descent (SGD) algorithm:
at each iteration i, compute the gradient over a random
subset Ti and perform one step of gradient descent:
1 X
θ ←θ−× ∇θ L(fθ (xt ), yt )
T
t∈Ti

at the next iteration, pick another random subset

when the entire dataset has been passed, start over again

The subsets must be random, disjoint, and cover the entire dataset.

Terminology:
Ti is called a minibatch
each pass over the entire dataset is called an epoch
Stochastic gradient descent 97
Stochastic gradient descent

Splitting the training set into B minibatches

reduces the computation cost of one gradient by a factor of B
increases √
the standard deviation on the gradient estimate by a
factor of B only.

More iterations, but fewer epochs (hence smaller total

computation cost).

Stochastic gradient descent 98

Practical implementation considerations

In practice, the gradients ∇θ L(fθ (xt ), yt ) for all examples (xt , yt )

are computed in parallel using a general-purpose graphical
processing unit (GPU) and summed within a given minibatch.

The choice of the minibatch size is governed by the following

practical considerations:
the minibatch data and computations must fit in GPU memory
too small minibatches do not exploit well GPU capabilities
some kinds of hardware perform better with power-of-2 sizes

Typical minibatch sizes: from 32 to 256.

Stochastic gradient descent 99

Limitation of (stochastic) gradient descent

0
x2

−10

−20

−30
−30 −20 −10 0 10 20
Tends to “zigzag” when descending a “canyon”, which increases
x 1

the number of iterations

4.6: Gradient descent fails to exploit the curvature information contain
matrix. Here we use
Stochastic gradient
gradient descent descent to minimize a quadratic function
100 f(
Stochastic gradient descent with momentum
Solution: smooth the gradient estimates across several iterations.

Momentum = vector v representing the direction and speed at

which the parameters move through parameter space.

Defined as an exponentially decaying average of the negative

gradient.

SGD with momentum: initialize v = 0, then replace each iteration

of SGD by:
1 X
v ← αv − × ∇θ L(fθ (xt ), yt )
T
t∈Ti
θ ←θ+v

Stochastic gradient descent 101

Stochastic gradient descent with momentum

−10

−20

−30
−30 −20 −10 0 10 20
Converges in fewer iterations than conventional SGD
8.5: Momentum aims primarily to solve two problems: poor condition
matrix and variance in the stochastic gradient. Here, we illustrate how m
mes the ﬁrstStochastic
of these two problems. The contour lines depict a 102
gradient descent
quad
Local optima
When properly tuned (learning rate not too large nor too small),
SGD converges to a local minimum.

How many local minima are they? Are they good or bad?

Neural networks always have multiple local minima because of

model identifiability issues:
reordering the neurons in each layer (n!m possible orders for m
layers with n units each),
or scaling the incoming weights and biases of a ReLU neuron
by α and its outgoing weights by 1/α
do not change the value of the cost function.

⇒ there can be a large or infinite number of local minima, but

they are are all equivalent to each other (hence not a problem).
Stochastic gradient descent 103
Local optima

For many years, people believed that large neural networks failed
because of poor local minima.

Recent theoretical and experimental results suggest that, for

sufficiently large neural networks:
most stationary points are saddle points corresponding to a
high value of the cost function,
but SGD manages to avoid them;
most local minima correspond to a low value of the cost
function.

Stochastic gradient descent 104

Vanishing/exploding gradient

At each layer of the backpropagation algorithm, the gradient is

multiplied by W (l) and scaled by g 0 (a (l−1) ).

Muliplying by W (l) scales the activations in the order of the largest

eigenvalue of W (l) which can be 1 or 1.

g 0 (a (l−1) ) is typically ≤ 1.

For very deep networks, this can cause:

vanishing gradient, g ≈ 0 ⇒ slow learning
exploding gradient, kgk 1 ⇒ unstable learning.

Problem especially common for recurrent networks (see later) and

when g 0 (a) ≈ 0 for a large range of a values.

Stochastic gradient descent 105

Sigmoid unit
1.5
g( x)
g ′(x)
1

0.5
h

-0.5
-5 0 5
a
1
g(a) = σ(a) = ⇒ g 0 (a) = σ(a)(1 − σ(a))
1 + exp(−a)
g 0 (a) ≈ 0 for many (large) negative or positive values

Stochastic gradient descent 106

Rectified linear unit
5
g( x)
4 g ′(x)

2
h

-1
-5 0 5
a (
0 if x < 0
g(a) = max(0, a) ⇒ g 0 (a) =
1 if x ≥ 0
better than sigmoid but still g 0 (a) = 0 for all a ≤ 0 ⇒ problem if a
gets stuck there, solved by batch normalization (see later)
Stochastic gradient descent 107
CONVOLUTIONAL NETWORKS
Convolutional neural networks

Convolutional neural networks (CNNs) were invented by LeCun

[1989].

CNN = network that uses a linear operation called convolution

instead of general matrix multiplication in at least one layer.

CNNs are appropriate for processing data with a known, grid-like

topology:
1D: time-series data (speech, text, finance, biosignals. . . ),
2D: images,
3D: video,
etc.

Convolutional neural networks 109

1D convolution
Given two sequences xt and kt , the convolved sequence at is
defined by the following linear operation:
X∞
at = xτ kt−τ
τ =−∞

The convolution operation is typically denoted with an asterisk:

a =x ∗k

Terminology:
x: input (or signal)
k: convolution kernel (or filter)
a: feature map

Examples: sliding window averaging, smoothing, etc

Convolutional neural networks 110

1D convolution

Convolutional neural networks 111

2D convolution

Convolution can be generalized to any dimension.

For instance, in 2D:

∞
X ∞
X
Ai,j = (X ∗ K )i,j = Xm,n Ki−m,j−n
m=−∞ n=−∞

Examples: edge detection, sharpening, blurring. . .

Convolutional neural networks 112

2D convolution

     
0 0 0 −1 −1 −1 0 −1 0
K = 0 1 0 K = −1 8 −1 K = −1 5 −1
0 0 0 −1 −1 −1 0 −1 0

Convolutional neural networks 113

2D convolution

     
1 1 1 1 2 1 −1 −2 −1
1 1  1 
K= 1 1 1 K= 2 4 2 K= −2 28 −2
9 16 16
1 1 1 1 2 1 −1 −2 −1

Convolutional neural networks 114

Practical implementation

In pratice:

Convolution is commutative:
∞
X ∞
X
Ai,j = (X ∗ K )i,j = (K ∗ X)i,j = Km,n Xi−m,j−n .
m=−∞ n=−∞

K has finite length M × N, hence m and n span finite

intervals:
M−1
X N−1
X
Ai,j = (X ∗ K )i,j = (K ∗ X)i,j = Km,n Xi−m,j−n .
m=0 n=0

Convolutional neural networks 115

Convolution vs. cross-correlation

Convolution is similar to cross-correlation:

M−1
X N−1
X
Ai,j = Km,n Xi+m,j+n .
m=0 n=0

The difference only lies in “flipping” the kernel.

Most machine learning libraries implement cross-correlation but

call it convolution.

Since kernel learned from data, makes no difference.

In the following, “convolution” refers to both convolution and

cross-correlation.

Convolutional neural networks 116

Cross-correlation
Input
Kernel
a b c d
w x
e f g h
y z
i j k l

Output

aw + bx + bw + cx + cw + dx +
ey + fz fy + gz gy + hz

ew + fx + fw + gx + gw + hx +
iy + jz jy + kz ky + lz

Figure 9.1: An example of 2-D convolution without kernel-ﬂipping. In this case we restrict 117
Convolutional neural networks
the output to only positions where the kernel lies entirely within the image, called “valid”
Motivation for CNNs

Convolution leverages three key machine learning concepts:

sparse connectivity,
parameter sharing,
equivariant representation.

In addition, it provides a means for handling variable size inputs.

Convolutional neural networks 118

Sparse connectivity

Traditional networks: every output fully connected to every input.

CNNs: every output connected only to a few neighboring inputs

due to small kernel.

Still, in a deep CNN, each unit in the upper layers can indirectly
interact with many inputs forming its receptive field.

Benefits:
smaller model size,
less overfitting,
faster computation.

Convolutional neural networks 119

Sparse connectivity
CHAPTER 9. CONVOLUTIONAL NETWORKS

Convolutional
CHAPTER 9. CONVOLUTIONAL NETWORKS

s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Figure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s3 ,
and also
Figure highlight
9.3: Sparse the input units
connectivity, in x from
viewed that above:
affect this unit. These
We highlight oneunits
outputareunit,
knowns3 ,
Convolutional neural networks 120
as the
and receptive
also highlightfield of s3 . units
the input (Top)When s isaffect
in x that formed
thisbyunit.
convolution with aare
These units kernel
known of
igure 9.3: Sparse connectivity, viewed from above: We highlight one output unit, s
nd also highlight the input units in x that affect this unit. These units are know
Receptive field
s the receptive field of s3 . (Top)When s is formed by convolution with a kernel
idth 3, only three inputs affect s 3 . (Bottom)When s is formed by matrix multiplicatio
onnectivity is no longer sparse, so all of the inputs affect s3 .

g1 g2 g3 g4 g5

h1 h2 h3 h4 h5

x1 x2 x3 x4 x5

igure 9.4: The receptive field of the units in the deeper layers of a convolutional networ
larger than the receptive field of the units in the shallow layers. This effect increases
he network includes architectural features like strided convolution (figure 9.12) or poolin
ection 9.3). This means that even though direct connections in a convolutional net ar
ery sparse, units in the deeper layers can be indirectly connected to all or most121of th
Convolutional neural networks
Receptive field

Convolutional neural networks 122

Parameter sharing

Traditional networks: every parameter used exactly once.

CNNs: parameters shared (tied) across all positions in the input.

Benefits:
even smaller model size,
even less overfitting.

Convolutional neural networks 123

Parameter sharing
HAPTER 9. CONVOLUTIONAL NETWORKS
HAPTER 9. CONVOLUTIONAL NETWORKS
Convolutional

s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

Fully connected
s1 s2 s3 s4 s5
s1 s2 s3 s4 s5

x1 x2 x3 x4 x5
x1 x2 x3 x4 x5

igure 9.5: Parameter sharing: Black arrows indicate the connections that use a particula
arameter
igure 9.5: in two diﬀerent
Parameter sharing: models.
Black (Top)The black the
arrows indicate arrows indicatethat
connections usesuse
of athe centr
particula
Convolutional neural networks 124
ement of a 3-element kernel in a convolutional model. Due to parameter sharing, th
Parameter sharing

Convolutional neural networks 125

Computational efficiency

Figure 9.6: Efficiency of edge detection. The image on the right was formed by taking
eachExample edge
pixel in the detection
original image and (280 × 320 pixels):
subtracting the value of its neighboring pixel on the
left. This shows the strength of all of the vertically oriented edges in the input image,
whichcanconvolutional:
be a useful operation1 × 2for kernel,
object 280 × 319Both
detection. × 3 images
≈ 2 × are
105280 pixels tall.
The input operations
image is 320 pixels wide while the output image is 319 pixels wide. This
transformation can be described by a convolution kernel containing 9two elements, and
319
requires fully× connected:
280 × 3 = 267, 280 320 × point
960×floating 280 ×operations
319 ≈ 8(two × 10multiplications
weights, and
≈ 16 × 10 operations
one addition per 9
output pixel) to compute using convolution. To describe the same
transformation with a matrix multiplication would take 320 × 280× 319 × 280, or over
eight billion, entries in the matrix, making convolution four billion times more efficient for
representingConvolutional
this transformation.
neural networks
The straightforward matrix multiplication algorithm 126
Equivariance
Equivariance: if the input is transformed, the output is transformed
in the same way.

CNN layers are equivariant to translation: if the input signal is

shifted by τ , the output feature map is also shifted by τ .

Hence the term “feature map”: provides a map showing where

different features appear in the input.

Example: when detecting objects in images, because each object

will look the same anywhere in the image, it is practical to share
parameters across the entire image.

Note: convolution is not equivariant to other transformations, e.g.,

scaling or rotation.

Convolutional neural networks 127

CNN layer
One CNN layer l typically consists of three successive stages:

linear convolution by multiple kernels Kf , each extracting a

(l)

different feature at many spatial locations from the input X or

from all outputs H (l−1) of the previous layer l − 1,

or
(1) (1) (l) (l)
Af = X ∗ Kf Af = H (l−1) ∗ Kf

nonlinear activation of all feature maps (feature detection),

(l) (l)
H̃f = g (l) (Af )

pooling across neighboring points in each map.

= pool(H̃f )
(l) (l)
Hf

Convolutional neural networks 128

CNN layer Complex layer terminology Simple layer terminology

Next layer Next layer

Convolutional Layer

Pooling stage Pooling layer

Detector stage:
Detector layer: Nonlinearity
Nonlinearity
e.g., rectiﬁed linear
e.g., rectiﬁed linear

Convolution stage: Convolution layer:

Aﬃne transform Aﬃne transform

Input to layer Input to layers

Convolutional neural networks 129

Dealing with multichannel inputs
By contrast with basic 1D/2D convolution, each entry of the grid
is usually not a scalar but a vector consisting of several channels:
red, green, blue intensities at each pixel of an input image,
multiple outputs from the previous layer.

The inputs are therefore 3D arrays (a.k.a. tensors) Xi,j,k : i, j index

the spatial coordinates, k indexes the input channels.

Convolution over the channels does not make sense.

In standard multichannel convolution, each input channel is fully

connected to all feature maps (a.k.a. output channels):
X M−1
X N−1
X
Ai,j,f = (X ∗1,2 K )i,j,f = Xm,n,k Ki+m,j+n,f ,k
k m=0 n=0

Convolutional neural networks 130

Dealing with multichannel inputs

Convolutional neural networks 131

Dealing with multichannel inputs
Standard 2D convolution requires F 3D kernels of size M × N × K ,
i.e., M × N × K × F parameters and multiplications for each
output pixel.

Depthwise separable convolution: first convolve the pixels in each

of the K input channels by one 2D kernel of size M × N (depthwise
convolution), then multiply the channels in each pixel by an F × K
matrix (pointwise convolution, a.k.a. 1 × 1 convolution).

X M−1
X N−1
X
Ai,j,f = (X ∗1,2 K )i,j,f = Kfpointwise
,k
depthwise
Xm,n,k Ki+m,j+n,k
k m=0 n=0

Depthwise separable convolution requires only M × N × K + K × F

parameters and multiplications for each output pixel. Benefits:
even smaller model size, even less overfitting, faster computation.
Convolutional neural networks 132
Zero-padding
Zero-padding = replacing undefined values of the input by 0.

Control the kernel width and the output size independently.

When X and K have finite sizes I and M along a given spatial

dimension, the size of A = X ∗ K along that dimension is equal to:

I − M + 1 if the convolution kernel must remain entirely

within the image (no zero-padding, a.k.a. valid convolution)

I if padding keeps the output size equal to the input (a.k.a.

same convolution)

I + M − 1 if enough zeroes are added for every pixel to be

visited M times (a.k.a. full convolution)

Convolutional neural networks 133

Zero-padding

Zero-padding implies some boundary effects:

either the border pixels are underrepresented in the model
(“valid”, “same”)
or the border pixels are a function of fewer pixels, which
makes it difficult to learn a single kernel that performs well at
all positions (“full”)

Usually the optimal amount of zero padding lies somewhere

between “valid” and “same”.

Convolutional neural networks 134

CHAPTER 9. CONVOLUTIONAL NETWORKS

Zero-padding
No zero padding (a.k.a. “valid”)

...

...
... ...

... ...

“same”
... ...
... ...
... ...
... ...
... ...
... ...
Figure 9.13: The eﬀect of zero padding on network size: Consider a convolutional network
with a kernel of width six at every layer. In this example, we do not use any pooling, so
Figure
only the9.13:
convolution operation
The eﬀect itself shrinks
of zero padding size: Consider
the network
on network a convolutional
size. (Top)In network
this convolutional
with a kernel
network, of width
we neural
Convolutional do sixany
at every
notnetworks
use layer.
implicit zeroInpadding.
this example,
Thiswe do not
causes theuse any pooling, so
representation to 135
Pooling

Pooling: replace neighboring points in a given feature map (a.k.a.

channel) H̃f by a single summary statistic.

Max pooling: maximum within a rectangular neighborhood

Hi,j,f = max H̃i+m,j+n,f

m∈{0,...,M−1}
n∈{0,...,N−1}

Other popular pooling functions:

average of a rectangular neighborhood,
`2 norm of a rectangular neighborhood,
weighted average based on the distance from the central pixel.

Convolutional neural networks 136

Translation invariance
Pooling makes the representation more invariant to small
translations of the input.
POOLING STAGE

1. 1. 1. 0.2
... ...

... 0.1 1. 0.2 0.1 ...

DETECTOR STAGE

POOLING STAGE

... 0.3 1. 1. 1. ...

... 0.3 0.1 1. 0.2 ...

DETECTOR STAGE

FigureConvolutional
9.8: Max pooling introduces invariance. (Top)A view of the middle of the output
neural networks 137
Translation invariance

Translation invariance is useful when we care more about whether

some feature is present than exactly where it is.

Example: to detect a face, we search for an eye on the left side

and an eye on the right side, but we don’t need to locate them
with pixel-level accuracy.

Counter-example: to denoise an image, we must preserve the

location of the features. In that situation pooling is not desirable.

Convolutional neural networks 138

Figure 9.9: Example of learned invariances: A pooling unit that pools over multiple features
that are learned with separate parameters can learn to be invariant to transformations of

Stride the input. Here we show how a set of three learned filters and a max pooling unit can learn
to become invariant to rotation. All three filters are intended to detect a hand-written 5.
Each filter attempts to match a slightly different orientation of the 5. When a 5 appears in
the input, the corresponding filter will match it and cause a large activation in a detector
unit. The max pooling unit then has a large activation regardless of which detector unit
was activated. We show here how the network processes two different inputs, resulting
Since pooling summarizes the responses over a neighborhood,
in two different detector units being activated. The effect on the pooling unit is roughly

report summary statistics every S pixels instead of every 1 pixel.

the same either way. This principle is leveraged by maxout networks (Goodfellow et al.,
2013a) and other convolutional networks. Max pooling over spatial positions is naturally
invariant to translation; this multi-channel approach is only necessary for learning other
transformations.
S is called the stride. It can be different for each direction in space.

1. 0.2 0.1

0.1 1. 0.2 0.1 0.0 0.1

Figure 9.10: Pooling with downsampling. Here we use max-pooling with a pool width of

This improves
of two, which computational efficiency because thenextnext
layer.layer
Note has
three and a stride between pools of two. This reduces the representation size by a factor
reduces the computational and statistical burden on the
roughly S times fewer inputs to process.
that the rightmost pooling region has a smaller size, but must be included if we do not
want to ignore some of the detector units.

344

Convolutional neural networks 139

CHAPTER 9. CONVOLUTIONAL NETWORKS

Stride
Efficient implementation based on strided convolution
s1 s2 s3

s1 s2 s3

Strided
convolution
Strided
convolution
x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

Inefficient implementation
s1 s2 s3

s1 s2 s3
Downsampling

Downsampling

z1 z2 z3 z4 z5

Convolution

Convolution
x1 x2 x3 x4 x5

x1 x2 x3 x4 x5

Figure 9.12: Convolution with a stride. In this example, we use a stride of two.
(Top)Convolution
Figure 9.12:neural with a stride
Convolution length of twoInimplemented in awesingle
use operation.
a stride of(Bot-
Convolutional networks withgreater
with
a stride. this example, two. 140
tom)Convolution
(Top)Convolution with aa stride than
stride length of twoone pixel is mathematically
implemented equivalent
in a single operation. to
(Bot-
Variable size inputs

Pooling also helps handling inputs of varying size.

Example: when classifying images of variable size, the input to the

classification layer must have a fixed size.

Solution: vary the pool size but not the number of pools such that
the classification layer always receives the same number of
summary statistics regardless of the input size.

For instance, pool over the whole image, independently of the

image size.

Convolutional neural networks 141

Example CNN architectures for image classification
CHAPTER 9. CONVOLUTIONAL NETWORKS

Output of softmax: Output of softmax: Output of softmax:

For illustration purposes only 1,000 class

probabilities
1,000 class
probabilities
1,000 class
probabilities

(modern CNNs are deeper) Output of matrix Output of matrix Output of average
multiply: 1,000 units multiply: 1,000 units pooling: 1x1x1,000

Left: 2 convolutional + 1
Output of reshape to Output of reshape to Output of
vector: vector: convolution:
16,384 units 576 units 16x16x1,000

fully connected layer Output of pooling

Output of pooling to
Output of pooling

(fixed-sized image)
with stride 4: with stride 4:
3x3 grid: 3x3x64
16x16x64 16x16x64

Output of Output of Output of

convolution + convolution + convolution +

Middle: 2 convolutional + ReLU: 64x64x64 ReLU: 64x64x64 ReLU: 64x64x64

1 fully connected layer

Output of pooling Output of pooling Output of pooling
with stride 4: with stride 4: with stride 4:

(variable-sized image)
64x64x64 64x64x64 64x64x64

Output of Output of Output of

convolution + convolution + convolution +
ReLU: 256x256x64 ReLU: 256x256x64 ReLU: 256x256x64

Right: fully convolutional Input image: Input image: Input image:

256x256x3 256x256x3 256x256x3

Figure 9.11: Examples of architectures for classification with convolutional networks. The
specific strides and depths used in this figure are not advisable for real use; they are
designed to be very shallow in order to fit onto the page. Real convolutional networks
also often involve significant amounts of branching, unlike the chain structures used
Convolutional neural networks here for simplicity. (Left)A convolutional network that processes a fixed image 142size.
After alternating between convolution and pooling for a few layers, the tensor for the
Graphical representation of CNN architectures

Example: LeNet-5 for digit recognition

Convolutional neural networks 143

Graphical representation of CNN architectures

Example: AlexNet for image classification

Convolutional neural networks 144

Graphical representation of CNN architectures
Example: VGG-16 for image classification

Convolutional neural networks 145

LINKS WITH HUMAN VISION
Visual cortex

Links with human vision 147

Previews
Visual cortex
Wallisch & Movshon (2008)

Figure 1. A Scaled Representation of thevision

Links with human Cortical Visual Areas of the Macaque 148
Each colored rectangle represents a visual area, for the most part following the names and definitions used by Felleman and Van Essen (1991). The gray bands
Visual cortex

Links with human vision 149

Receptive field measurement

Reverse correlation approach for receptive field measurement in

animals:

put an electrode in an individual neuron,

display several white noise images in front of the retina,

record the resulting neuron responses,

fit a linear model in order to estimate the neuron’s weights.

Note: record first 100 ms only. After that, information begins to

flow backwards as the brain uses top-down feedback.

Links with human vision 150

Retina and lateral geniculate nucleus (LGN)
Foveal retina cells → parvocellular (P) layers of the LGN.
Provide fine details required to determine what an object is.

Peripheral retina cells → magnocellular (M) layers of the LGN.

Provide coarse information as Receptive
to where an Fields
object is.
V1
Their receptive fields are round.
Retina LGN

Simple 2-lobe

ON/OFF ON/OFF

Simple 3-lobe

OFF/ON OFF/ON

Links with human vision 151

Primary visual cortex (V1)
V1 has a two-dimensional structure mirroring the structure of the
image in the retina.

It contains three cell types [Hubel & Wiesel, 1981 Nobel Prize]:
layer 4 cells have round receptive fields like retina/LGN cells,
simple
V1) contains 3 cells
cellhave elongated, localized receptive fields,
types
complex cells also respond to elongated features, but are
e receptive invariant
fields aretoround
smalllike
shifts in the
those of position
LGN and of ganglion
the feature.
cells,
ongated
mally
lar

e
hose of
e can lie
and these fireLinksmore to moving lines.
with human vision 152
to Simple
those of cells
hose

line can lie

na andSeveral retina
these fire cells,
more to whose
movingrecep-
lines.
tive fields lie along a common line,
converge onto
plex receptive a simple
fields cell.
produced?
The corresponding weights are
well receptive
lls, whose modeled by a Gabor
fields function
lie along a
way ofwith specific:
the LGN onto a simple cell.
maximally for a horizontal line and
spatial position
e. Each simple cell is selective for one
spatial orientation
spatial scale along both axes
spatial frequency
mple cells of the same orientation
spatial phase
ell.

Links with human vision 153

CHAPTER 9. CONVOLUTIONAL NETW
Simple cells
Varying spatial position and orientation

Links with human vision 154

UTIONAL NETWORKS
Simple cells
Varying spatial scale

Links with human vision 155

Simple cells
Varying spatial frequency and phase

Links with human vision 156

lls, whose receptive fields lie along a
way of the LGN onto a simple cell.
Complex
maximally for acells
horizontal line and
e. Each simple cell is selective for one

mple cells of the same orientation

ell.
Several simple cells of the same
orientation converge onto a com-
plex cell.

and clinically relevant?

how the visual system can

tations from simple features.
of a simple cell is tuned to a very
mplex cell generalizes this
Links with human vision 157
Link with artificial CNNs

Simple cell ↔ Convolution by Gabor atom + nonlinear activation

Complex cell ↔ Intra/cross-channel pooling

Links with human vision 158

Secondary visual cortex (V2) and color center (V4)

V2 cells:
are tuned to simple properties similar to V1: spatial
orientation, frequency, color. . .
but also: figure vs. ground, multiple orientations in different
regions of a receptive field, some attentional modulation.

V4 cells:
are tuned to simple properties similar to V2,
but also: object features of intermediate complexity, like
simple geometric shapes, strong attentional modulation.

Detect more and more complex features similarly to moving up the

layers of a CNN.

Links with human vision 159

Inferotemporal (IT) cortex

Two types of cells found in the

IT cortex:

primary cells respond to

slits, spots, ellipses,
squares

elaborate cells respond to

specific complex shapes
(influenced by color,
texture)

Links with human vision 160

Inferotemporal (IT) cortex
Fusiform face area of the IT cortex: cells responsive to faces.

Links with human vision 161

Link with artificial CNNs
Predictions of single site IT responses from layer 4 of HMO 1.0 model
IT = closest analog to a CNN’s last layer of features.

Evidence fromaresingle-site
These IT firing
PREDICTIONS: All ofrates.
these objects and images were
never previously seen by the HMO model

d
HMO (M1 IT only)

HMO (M2 IT only)

Unit 1: r 2 = 0.48

Response* of
IT neural site
HMO

Prediction of Animals Boats Cars Chairs Faces Fruits Planes Tables

HMO model
© Proceedings of the National Academy of Sciences. All rights reserved. This content is excluded from
our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use/.
Source: Yamins, Daniel LK, Ha Hong, Charles F. Cadieu, Ethan A. Solomon, Darren Seibert, and James
J. DiCarlo. "Performance-optimized hierarchical models predict neural responses in higher visual cortex."
Proceedings of the National Academy of Sciences 111, no. 23 (2014): 8619-8624.

Yamins, Hong, Solomon, Seibert

(* mean Links
rate with human
70-170 msvision
after image onset) and DiCarlo PNAS (2014)
162
Link with artificial
Better CNNsdeep CNN networks also better
performing
predict the patterns of IT neural responses
Evidence from representation dissimilarity matrices.
B V4 Cortex IT Cortex
animals
Neural Representations

1.0
cars

relative distance
chairs

faces

fruits

planes

tables 0.0

HMO Krizhevsky et al. 2012 Zeiler & Fergus 2013

Model Representations
+ IT-fit

Cadieu, Charles F., Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A. Solomon, Najib J.
Majaj, and James J. DiCarlo. "Deep neural networks rival the representation of primate IT cortex for
ons

Links with human

core visual vision
object recognition. "PLoS Comput Biol 10, no. 12 (2014): e1003963; 163
Remaining differences with artificial CNNs

human vision artificial CNN

low-resolution peripheral retina cells high-resolution image
continuous eye movement (saccades) still image
integration with other senses (often) images only
biological detection and pooling ReLU, max pooling
top-down feedback bottom-up processing
unknown learning mechanism SGD

Links with human vision 164

Mathematics of Machine Learning
No ratings yet
Mathematics of Machine Learning
577 pages
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
100% (1)
Deep Learning-Powered Technologies Autonomous Driving, Artificial Intelligence of Things (AIoT), Augmented Reality, 5G Communications and Beyond
216 pages
Computer Vision Unit 4
No ratings yet
Computer Vision Unit 4
186 pages
Merged PDF Cset301 Ai-Ml
No ratings yet
Merged PDF Cset301 Ai-Ml
610 pages
Complete Bundle Discovering Psychology The Science of Mind 4th Edition Cacioppo HQ File
100% (1)
Complete Bundle Discovering Psychology The Science of Mind 4th Edition Cacioppo HQ File
408 pages
CEC453 Machine Learning
No ratings yet
CEC453 Machine Learning
168 pages
ANN Notes
No ratings yet
ANN Notes
54 pages
BB Advanced Technical English Grammar 1
No ratings yet
BB Advanced Technical English Grammar 1
140 pages
Algorithms For Big Data
100% (1)
Algorithms For Big Data
458 pages
ArchictureForBlockchainApplications PDF
No ratings yet
ArchictureForBlockchainApplications PDF
30 pages
Nitin Mittal, Amit Kant Pandit, Mohamed Abouhawwash, Shubham Mahajan - Intelligent Systems and Applications in Computer Vision-Routledge (2023)
No ratings yet
Nitin Mittal, Amit Kant Pandit, Mohamed Abouhawwash, Shubham Mahajan - Intelligent Systems and Applications in Computer Vision-Routledge (2023)
341 pages
Machine Learning 1
No ratings yet
Machine Learning 1
160 pages
Lecture Notes SC
No ratings yet
Lecture Notes SC
21 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Keras Succinctly
No ratings yet
Keras Succinctly
107 pages
Lec 01 Introduction
No ratings yet
Lec 01 Introduction
98 pages
Deep Learning Unit-III
No ratings yet
Deep Learning Unit-III
9 pages
Gacovski Z Ed Soft Computing and Machine Learning With Pytho
No ratings yet
Gacovski Z Ed Soft Computing and Machine Learning With Pytho
380 pages
Deep Learning
No ratings yet
Deep Learning
189 pages
Deep Learning
No ratings yet
Deep Learning
127 pages
Soft Computing Decode
No ratings yet
Soft Computing Decode
142 pages
Advances in Intelligent Information and Database Systems
No ratings yet
Advances in Intelligent Information and Database Systems
371 pages
Unit 5
No ratings yet
Unit 5
23 pages
Feature Selection Engineering
No ratings yet
Feature Selection Engineering
72 pages
AIML Chapter 4
No ratings yet
AIML Chapter 4
100 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Backpropagation Learning in Neural Networks
No ratings yet
Backpropagation Learning in Neural Networks
27 pages
Lecture 01 (Introduction To Pattern Recognition)
No ratings yet
Lecture 01 (Introduction To Pattern Recognition)
26 pages
FT of AI
No ratings yet
FT of AI
109 pages
Deep Learning
No ratings yet
Deep Learning
169 pages
Neural Networks For Optimization and Signal Processing
No ratings yet
Neural Networks For Optimization and Signal Processing
549 pages
ML Lab
No ratings yet
ML Lab
44 pages
Dictionary - of - Computer - Vision - and - Image Book PDF
No ratings yet
Dictionary - of - Computer - Vision - and - Image Book PDF
338 pages
UE20CS302 Unit4 Slides
No ratings yet
UE20CS302 Unit4 Slides
312 pages
Chapter 4 Neural Network
No ratings yet
Chapter 4 Neural Network
46 pages
Ann Chapter 2
No ratings yet
Ann Chapter 2
240 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
Fake News Detection Natural Language Processing
No ratings yet
Fake News Detection Natural Language Processing
62 pages
Face Recognition With GNU Octave/MATLAB: Philipp Wagner
No ratings yet
Face Recognition With GNU Octave/MATLAB: Philipp Wagner
14 pages
AI-Lecture 12 - Simple Perceptron
100% (1)
AI-Lecture 12 - Simple Perceptron
24 pages
Deep Learning Approaches For Network Int
No ratings yet
Deep Learning Approaches For Network Int
116 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Ann Book
No ratings yet
Ann Book
16 pages
Depth Prediction Single Image
No ratings yet
Depth Prediction Single Image
8 pages
Syllabus
No ratings yet
Syllabus
2 pages
GNN Review
No ratings yet
GNN Review
26 pages
Lecture 4.a - Greedy Algorithms
No ratings yet
Lecture 4.a - Greedy Algorithms
45 pages
Machine Learning and Data Mining in Manufacturing
No ratings yet
Machine Learning and Data Mining in Manufacturing
45 pages
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 3: Sudeshna Sarkar
167 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
135 pages
Graph Neural Network The Next Frontier in Deep Learning
No ratings yet
Graph Neural Network The Next Frontier in Deep Learning
1 page
ANN Supervised Learning (Compatibility Mode)
No ratings yet
ANN Supervised Learning (Compatibility Mode)
73 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
No ratings yet
Deep Learning Methods and Applications For Electrical Power Systems A Comprehensive Review
22 pages
P1 - Single Layer Feed Forward Networks
No ratings yet
P1 - Single Layer Feed Forward Networks
52 pages
API Drilling Related Standards From API
100% (3)
API Drilling Related Standards From API
3 pages
NeuralNetworks One PDF
No ratings yet
NeuralNetworks One PDF
58 pages
Machine Learning Basic Principles
No ratings yet
Machine Learning Basic Principles
124 pages
Soft Computing 2017
No ratings yet
Soft Computing 2017
323 pages
Learning Rules of ANN
No ratings yet
Learning Rules of ANN
25 pages
The Backpropagation Algorithm
No ratings yet
The Backpropagation Algorithm
4 pages
Differences Between Precision and Comfort Cooling
No ratings yet
Differences Between Precision and Comfort Cooling
5 pages
Css s24 Model Answer Paper of Summer 2024 Exam Css
No ratings yet
Css s24 Model Answer Paper of Summer 2024 Exam Css
31 pages
Final Step Part - B Maths
No ratings yet
Final Step Part - B Maths
88 pages
Steps in Setting Up Business On Internet
No ratings yet
Steps in Setting Up Business On Internet
7 pages
Deploying Vmware NSX With Cisco Aci As The Physical Switch Fabric Design Guide Version 2023 Noindex
No ratings yet
Deploying Vmware NSX With Cisco Aci As The Physical Switch Fabric Design Guide Version 2023 Noindex
92 pages
Old Gcse Coursework Tasks
100% (2)
Old Gcse Coursework Tasks
8 pages
MOdel Paper
No ratings yet
MOdel Paper
2 pages
13.5.1 Packet Tracer - WLAN Configuration - ILM
No ratings yet
13.5.1 Packet Tracer - WLAN Configuration - ILM
4 pages
Seismic Reference Datums
No ratings yet
Seismic Reference Datums
12 pages
Retail Assignment Sakshi Sharda
No ratings yet
Retail Assignment Sakshi Sharda
7 pages
Ddos Attacks and How To Protect Against Them: Martin Oravec
No ratings yet
Ddos Attacks and How To Protect Against Them: Martin Oravec
34 pages
Senarai Frekuensi, Stesen Radio Di Malaysia
No ratings yet
Senarai Frekuensi, Stesen Radio Di Malaysia
2 pages
Building and Installing The USRP Open-Source Toolchain (UHD and GNU Radio) On Linux PDF
No ratings yet
Building and Installing The USRP Open-Source Toolchain (UHD and GNU Radio) On Linux PDF
5 pages
Q173HCPU
No ratings yet
Q173HCPU
206 pages
How RTS Empowers Airlines With AI-Driven Dynamic Pricing Strategies
No ratings yet
How RTS Empowers Airlines With AI-Driven Dynamic Pricing Strategies
4 pages
Multiple Screen Addiction Part 1
0% (1)
Multiple Screen Addiction Part 1
18 pages
6 Room Layout GO 960 UP-1
No ratings yet
6 Room Layout GO 960 UP-1
1 page
Pelco Storage Estimator
No ratings yet
Pelco Storage Estimator
2 pages
Interchange Limit and Summation
No ratings yet
Interchange Limit and Summation
4 pages
Wa0017.
No ratings yet
Wa0017.
4 pages
Strategi Pengembangan Bisnis Tambak Ikan Bandeng Di Desa Mengare Watuagung Gresik
No ratings yet
Strategi Pengembangan Bisnis Tambak Ikan Bandeng Di Desa Mengare Watuagung Gresik
8 pages
Chat List
No ratings yet
Chat List
1 page
Account Payable Process Gearinc Surabaya
No ratings yet
Account Payable Process Gearinc Surabaya
5 pages
L-3, Output Devices by Arpita Mam
No ratings yet
L-3, Output Devices by Arpita Mam
22 pages
DWT-PCA (EVD) Based Copy-Move Image Forgery Detection
No ratings yet
DWT-PCA (EVD) Based Copy-Move Image Forgery Detection
8 pages
FINAL Round Robin 3
No ratings yet
FINAL Round Robin 3
6 pages
2008 - Civil Engineering
No ratings yet
2008 - Civil Engineering
29 pages
IP and Vlan Planning - NGN
No ratings yet
IP and Vlan Planning - NGN
14 pages
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet