0% found this document useful (0 votes)
4 views48 pages

Unit 3 .

The lecture discusses the limitations of linear classifiers, particularly in relation to non-linearly separable functions like XOR, and introduces multilayer perceptrons as a solution. It explains the architecture of neural networks, including the use of activation functions and the concept of feature learning. Finally, it outlines the backpropagation algorithm as a method for learning in neural networks, emphasizing its role in computing gradients for optimization.

Uploaded by

rafeedahjannath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views48 pages

Unit 3 .

The lecture discusses the limitations of linear classifiers, particularly in relation to non-linearly separable functions like XOR, and introduces multilayer perceptrons as a solution. It explains the architecture of neural networks, including the use of activation functions and the concept of feature learning. Finally, it outlines the backpropagation algorithm as a method for learning in neural networks, emphasizing its role in computing gradients for optimization.

Uploaded by

rafeedahjannath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CS 4995 Lecture 2:

Multilayer Perceptrons & Backpropagation

Richard Zemel

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 1 / 45


Limits of Linear Classification

Single neurons (linear classifiers) are very limited in expressive power.


XOR is a classic example of a function that’s not linearly separable.

There’s an elegant proof using convexity.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 2 / 45


Limits of Linear Classification
Convex Sets

A set S is convex if any line segment connecting points in S lies


entirely within S. Mathematically,

x1 , x2 ∈ S =⇒ λx1 + (1 − λ)x2 ∈ S for 0 ≤ λ ≤ 1.

A simple inductive argument shows that for x1 , . . . , xN ∈ S, weighted


averages, or convex combinations, lie within the set:

λ1 x1 + · · · + λN xN ∈ S for λi > 0, λ1 + · · · λN = 1.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 3 / 45


Limits of Linear Classification

Showing that XOR is not linearly separable


Half-spaces are obviously convex.
Suppose there were some feasible hypothesis. If the positive examples are in
the positive half-space, then the green line segment must be as well.
Similarly, the red line segment must line within the negative half-space.

But the intersection can’t lie in both half-spaces. Contradiction!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 4 / 45


• Suppose we just use pixels as pattern A
Limitsthe
offeatures.
Linear Classification
nating simple patterns pattern A
•Can a binary threshold unit
pattern A
slation with
A more wrap-around
discriminate
troubling between
example different
patterns that have the same
number of on pixels? pattern A pattern B
xels as
– Not if the patterns can
pattern A pattern B
translate with wrap-around!
unit pattern B
pattern A
fferent
ame
pattern B
These images represent 16-dimensional vectors. White = 0, black = 1.
an
pattern B
around! Want to distinguish patterns A and B in all possible translations (with
wrap-around) pattern B
Translation invariance is commonly desired in vision!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 5 / 45


• Suppose we just use pixels as pattern A
Limitsthe
offeatures.
Linear Classification
nating simple patterns pattern A
•Can a binary threshold unit
pattern A
slation with
A more wrap-around
discriminate
troubling between
example different
patterns that have the same
number of on pixels? pattern A pattern B
xels as
– Not if the patterns can
pattern A pattern B
translate with wrap-around!
unit pattern B
pattern A
fferent
ame
pattern B
These images represent 16-dimensional vectors. White = 0, black = 1.
an
pattern B
around! Want to distinguish patterns A and B in all possible translations (with
wrap-around) pattern B
Translation invariance is commonly desired in vision!
Suppose there’s a feasible solution. The average of all translations of A is the
vector (0.25, 0.25, . . . , 0.25). Therefore, this point must be classified as A.
Similarly, the average of all translations of B is also (0.25, 0.25, . . . , 0.25).
Therefore, it must be classified as B. Contradiction!

Credit: Geoffrey Hinton


Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 5 / 45
Limits of Linear Classification
Sometimes we can overcome this limitation using feature maps, just
like for linear regression. E.g., for XOR:
 
x1
ψ(x) =  x2 
x1 x2

x1 x2 φ1 (x) φ2 (x) φ3 (x) t


0 0 0 0 0 0
0 1 0 1 0 1
1 0 1 0 0 1
1 1 1 1 1 0

This is linearly separable. (Try it!)


Not a general solution: it can be hard to pick good basis functions.
Instead, we’ll use neural nets to learn nonlinear hypotheses directly.
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 6 / 45
Multilayer Perceptrons

We can connect lots of


units together into a
directed acyclic graph.
This gives a feed-forward
neural network. That’s
in contrast to recurrent
neural networks, which
can have cycles. (We’ll
talk about those later.)
Typically, units are
grouped together into
layers.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 7 / 45


Multilayer Perceptrons

Each layer connects N input units to M output units.


In the simplest case, all input units are connected to all output units. We call this
a fully connected layer. We’ll consider other layer types later.
Note: the inputs and outputs for a layer are distinct from the inputs and outputs
to the network.

Recall from softmax regression: this means we


need an M × N weight matrix.
The output units are a function of the input
units:
y = f (x) = φ (Wx + b)
A multilayer network consisting of fully
connected layers is called a multilayer
perceptron. Despite the name, it has nothing
to do with perceptrons!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 8 / 45


Multilayer Perceptrons

Some activation functions:

Rectified Linear Unit


Linear Soft ReLU
(ReLU)
y =z y = log 1 + e z
y = max(0, z)

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 9 / 45


Multilayer Perceptrons

Some activation functions:

Hyperbolic Tangent
Hard Threshold Logistic
 (tanh)
1 if z > 0 1
y=
0 if z ≤ 0 y= e z − e −z
1 + e −z y=
e z + e −z

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 10 / 45


Multilayer Perceptrons

Designing a network to compute XOR:

Assume hard threshold activation function

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 11 / 45


Multilayer Perceptrons

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 12 / 45


Multilayer Perceptrons

Each layer computes a function, so the network


computes a composition of functions:

h(1) = f (1) (x)


h(2) = f (2) (h(1) )
..
.
y = f (L) (h(L−1) )

Or more simply:

y = f (L) ◦ · · · ◦ f (1) (x).

Neural nets provide modularity: we can implement


each layer’s computations as a black box.
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 13 / 45
Feature Learning
Neural nets can be viewed as a way of learning features:

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 14 / 45


Feature Learning
Neural nets can be viewed as a way of learning features:

The goal:

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 14 / 45


Feature Learning

Input representation of a digit : 784 dimensional vector.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 15 / 45


Feature Learning

Each first-layer hidden unit computes σ(wiT x)


Here is one of the weight vectors (also called a feature).
It’s reshaped into an image, with gray = 0, white = +, black = -.
To compute wiT x, multiply the corresponding pixels, and sum the result.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 16 / 45


Feature Learning

There are 256 first-level features total. Here are some of them.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 17 / 45


Expressive Power

We’ve seen that there are some functions that linear classifiers can’t
represent. Are deep networks any better?
Any sequence of linear layers can be equivalently represented with a
single linear layer.
y = |W(3) W{z(2) W(1)} x
,W0

Deep linear networks are no more expressive than linear regression!


Linear layers do have their uses — stay tuned!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 18 / 45


Expressive Power

Multilayer feed-forward neural nets with nonlinear activation functions


are universal approximators: they can approximate any function
arbitrarily well.
This has been shown for various activation functions (thresholds,
logistic, ReLU, etc.)
Even though ReLU is “almost” linear, it’s nonlinear enough!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 19 / 45


Expressive Power
Universality for binary inputs and targets:
Hard threshold hidden units, linear output
Strategy: 2D hidden units, each of which responds to one particular
input configuration

Only requires one hidden layer, though it needs to be extremely wide!


Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 20 / 45
Expressive Power

What about the logistic activation function?


You can approximate a hard threshold by scaling up the weights and
biases:

y = σ(x) y = σ(5x)
This is good: logistic units are differentiable, so we can tune them
with gradient descent. (Stay tuned!)

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 21 / 45


Expressive Power

Limits of universality

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 22 / 45


Expressive Power

Limits of universality
You may need to represent an exponentially large network.
If you can learn any function, you’ll just overfit.
Really, we desire a compact representation!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 22 / 45


Expressive Power

Limits of universality
You may need to represent an exponentially large network.
If you can learn any function, you’ll just overfit.
Really, we desire a compact representation!
We’ve derived units which compute the functions AND, OR, and
NOT. Therefore, any Boolean circuit can be translated into a
feed-forward neural net.
This suggests you might be able to learn compact representations of
some complicated functions

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 22 / 45


Overview

We’ve seen that multilayer neural networks are powerful. But how can
we actually learn them?
Backpropagation is the central algorithm in this course.
It’s is an algorithm for computing gradients.
Really it’s an instance of reverse mode automatic differentiation, which
is much more broadly applicable than just neural nets.
This is “just” a clever and efficient use of the Chain Rule for derivatives.
We’ll see how to implement an automatic differentiation system next
week.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 23 / 45


Recap: Gradient Descent
Recall: gradient descent moves opposite the gradient (the direction of
steepest descent)

Weight space for a multilayer neural net: one coordinate for each weight or
bias of the network, in all the layers
Conceptually, not any different from what we’ve seen so far — just higher
dimensional and harder to visualize!
We want to compute the cost gradient dJ /dw, which is the vector of
partial derivatives.
This is the average of dL/dw over all the training examples, so in this
lecture we focus on computing dL/dw.
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 24 / 45
Univariate Chain Rule

We’ve already been using the univariate Chain Rule.


Recall: if f (x) and x(t) are univariate functions, then

d df dx
f (x(t)) = .
dt dx dt

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 25 / 45


Univariate Chain Rule

Recall: Univariate logistic least squares model

z = wx + b
y = σ(z)
1
L = (y − t)2
2

Let’s compute the loss derivatives.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 26 / 45


Univariate Chain Rule

How you would have done it in calculus class


1
L= (σ(wx + b) − t)2  
2 ∂L ∂ 1
  = (σ(wx + b) − t)2
∂L ∂ 1 ∂b ∂b 2
= (σ(wx + b) − t)2
∂w ∂w 2 1 ∂
= (σ(wx + b) − t)2
1 ∂ 2 2 ∂b
= (σ(wx + b) − t)
2 ∂w ∂
= (σ(wx + b) − t) (σ(wx + b) − t)
∂ ∂b
= (σ(wx + b) − t) (σ(wx + b) − t)
∂w ∂
= (σ(wx + b) − t)σ 0 (wx + b) (wx + b)
∂ ∂b
= (σ(wx + b) − t)σ 0 (wx + b) (wx + b)
∂w = (σ(wx + b) − t)σ 0 (wx + b)
0
= (σ(wx + b) − t)σ (wx + b)x

What are the disadvantages of this approach?

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 27 / 45


Univariate Chain Rule

A more structured way to do it

Computing the derivatives:


Computing the loss: dL
=y −t
z = wx + b dy
dL dL 0
y = σ(z) = σ (z)
dz dy
1
L = (y − t)2 ∂L dL
2 = x
∂w dz
∂L dL
=
∂b dz

Remember, the goal isn’t to obtain closed-form solutions, but to be able


to write a program that efficiently computes the derivatives.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 28 / 45


Univariate Chain Rule

We can diagram out the computations using a computation graph.


The nodes represent all the inputs and computed quantities, and the
edges represent which nodes are computed directly as a function of
which other nodes.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 29 / 45


Univariate Chain Rule

A slightly more convenient notation:


Use y to denote the derivative dL/dy , sometimes called the error signal.
This emphasizes that the error signals are just values our program is
computing (rather than a mathematical operation).
This is not a standard notation, but I couldn’t find another one that I liked.

Computing the loss: Computing the derivatives:


z = wx + b y =y −t
y = σ(z) z = y σ 0 (z)
1
L = (y − t)2 w =zx
2
b=z

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 30 / 45


Multivariate Chain Rule
Problem: what if the computation graph has fan-out > 1?
This requires the multivariate Chain Rule!

L2 -Regularized regression Multiclass logistic regression

z = wx + b
y = σ(z) X
1 z` = w`j xj + b`
L = (y − t)2 j
2
1 e zk
R = w2 yk = P z
2 `e
`

Lreg = L + λR
X
L=− tk log yk
k
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 31 / 45
Multivariate Chain Rule
Suppose we have a function f (x, y ) and functions x(t) and y (t). (All
the variables here are scalar-valued.) Then

d ∂f dx ∂f dy
f (x(t), y (t)) = +
dt ∂x dt ∂y dt

Example:
f (x, y ) = y + e xy
x(t) = cos t
y (t) = t 2
Plug in to Chain Rule:
df ∂f dx ∂f dy
= +
dt ∂x dt ∂y dt
= (ye ) · (− sin t) + (1 + xe xy ) · 2t
xy

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 32 / 45


Multivariable Chain Rule

In the context of backpropagation:

In our notation:
dx dy
t=x +y
dt dt

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 33 / 45


Backpropagation

Full backpropagation algorithm:


Let v1 , . . . , vN be a topological ordering of the computation graph
(i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 34 / 45


Backpropagation
Example: univariate logistic least squares regression
Backward pass:

Lreg = 1 dy
dLreg z =y
R = Lreg dz
dR = y σ 0 (z)
Forward pass: = Lreg λ ∂z dR
dLreg w= z +R
z = wx + b L = Lreg ∂w dw
dL = z x + Rw
y = σ(z)
= Lreg ∂z
1 b=z
L = (y − t)2 dL ∂b
2 y =L
1 dy =z
R = w2
2 = L (y − t)
Lreg = L + λR

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 35 / 45


Backpropagation
Multilayer Perceptron (multiple outputs):
Backward pass:
L=1
yk = L (yk − tk )
(2)
wki = yk hi
(2)
bk = yk
Forward pass: hi =
X (2)
yk wki
k
X (1) (1)
zi = wij xj + bi
j zi = hi σ 0 (zi )
hi = σ(zi ) (1)
wij = zi xj
X (2) (2)
yk = wki hi + bk bi
(1)
= zi
i
1X
L= (yk − tk )2
2
k

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 36 / 45


Vector Form

Computation graphs showing individual units are cumbersome.


As you might have guessed, we typically draw graphs over the
vectorized variables.

We pass messages back analogous to the ones for scalar-valued nodes.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 37 / 45


Vector Form
Consider this computation graph:

Backprop rules:
X ∂yk ∂y >
zj = yk z= y,
∂zj ∂z
k

where ∂y/∂z is the Jacobian matrix:


 ∂y1 ∂y1 
∂z ··· ∂zn
∂y  . 1 . .. 
=  .. .. . 
∂z ∂ym ∂ym
∂z1 ··· ∂zn

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 38 / 45


Vector Form

Examples
Matrix-vector product
∂z
z = Wx =W x = W> z
∂x
Elementwise operations
 
exp(z1 ) 0
∂y ..
y = exp(z) = z = exp(z) ◦ y
 
∂z . 
0 exp(zD )

Note: we never explicitly construct the Jacobian. It’s usually simpler


and more efficient to compute the VJP directly.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 39 / 45


Vector Form

Full backpropagation algorithm (vector form):


Let v1 , . . . , vN be a topological ordering of the computation graph
(i.e. parents come before children.)
vN denotes the variable we’re trying to compute derivatives of (e.g. loss).
It’s a scalar, which we can treat as a 1-D vector.

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 40 / 45


Vector Form

MLP example in vectorized form:

Backward pass:
L=1
y = L (y − t)
W(2) = yh>
Forward pass: b(2) = y
z = W(1) x + b(1) h = W(2)> y
h = σ(z) z = h ◦ σ 0 (z)
y = W(2) h + b(2) W(1) = zx>
1 b(1) = z
L = kt − yk2
2
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 41 / 45
Computational Cost
Computational cost of forward pass: one add-multiply operation per
weight X (1) (1)
zi = wij xj + bi
j

Computational cost of backward pass: two add-multiply operations


per weight
(2)
wki = yk hi
(2)
X
hi = yk wki
k

Rule of thumb: the backward pass is about as expensive as two


forward passes.
For a multilayer perceptron, this means the cost is linear in the
number of layers, quadratic in the number of units per layer.
Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 42 / 45
Closing Thoughts

Backprop is used to train the overwhelming majority of neural nets today.


Even optimization algorithms much fancier than gradient descent
(e.g. second-order methods) use backprop to compute the gradients.
Despite its practical success, backprop is believed to be neurally implausible.
No evidence for biological signals analogous to error derivatives.
All the biologically plausible alternatives we know about learn much
more slowly (on computers).
So how on earth does the brain learn?

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 43 / 45


Closing Thoughts

The psychological profiling [of a programmer] is mostly the ability to shift


levels of abstraction, from low level to high level. To see something in the
small and to see something in the large.

– Don Knuth
By now, we’ve seen three different ways of looking at gradients:
Geometric: visualization of gradient in weight space
Algebraic: mechanics of computing the derivatives
Implementational: efficient implementation on the computer
When thinking about neural nets, it’s important to be able to shift
between these different perspectives!

Richard Zemel CS 4995 Lecture 2: Multilayer Perceptrons & Backpropagation 44 / 45

You might also like