Chapter 20 B
Chapter 20 B
Chapter 20
Chapter 20 1
Outline
♦ Brains
♦ Neural networks
♦ Perceptrons
♦ Multilayer networks
♦ Applications of neural networks
Chapter 20 2
Brains
1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time
Signals are noisy “spike trains” of electrical potential
Axonal arborization
Synapse
Dendrite Axon
Nucleus
Synapses
Chapter 20 3
McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:
ai ← g(ini) = g Σj Wj,iaj
Bias Weight
a0 = −1 ai = g(ini)
W0,i
g
ini
Σ
Wj,i
aj ai
Chapter 20 4
Activation functions
g(ini) g(ini)
+1 +1
ini ini
(a) (b)
Chapter 20 5
Implementing logical functions
Chapter 20 6
Implementing logical functions
W1 = 1 W1 = 1
W1 = –1
W2 = 1 W2 = 1
AND OR NOT
Chapter 20 7
Network structures
Feed-forward networks:
– single-layer perceptrons
– multi-layer networks
Feed-forward networks implement functions, have no internal state
Recurrent networks:
– Hopfield networks have symmetric weights (Wi,j = Wj,i)
g(x) = sign(x), ai = ± 1; holographic associative memory
– Boltzmann machines use stochastic activation functions,
≈ MCMC in BNs
– recurrent neural nets have directed cycles with delays
⇒ have internal state (like flip-flops), can oscillate etc.
Chapter 20 8
Feed-forward example
W1,3
1 3
W3,5
W1,4
5
W2,3 W4,5
2 4
W2,4
Chapter 20 9
Perceptrons
Perceptron output
1
0.8
0.6
0.4
0.2 4
0 2
-4 -2 0 x2
-2
Input Output x1
0 2 -4
Wj,i 4
Units Units
Chapter 20 10
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc.
Represents a linear separator in input space:
Σj Wj xj > 0 or W · x > 0
I1 I1 I1
1 1 1
0 0 0
0 1 I2 0 1 I2 0 1 I2
(a) I1 and I2 (b) I1 or I2 (c) I1 xor I2
Chapter 20 11
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Chapter 20 12
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E
=?
∂Wj
Chapter 20 13
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂ n
y − g(Σj = 0Wj xj )
= Err × = Err ×
∂Wj ∂Wj ∂Wj
Chapter 20 14
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂ n
y − g(Σj = 0Wj xj )
= Err × = Err ×
∂Wj ∂Wj ∂Wj
= −Err × g ′(in) × xj
Chapter 20 15
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂ n
y − g(Σj = 0Wj xj )
= Err × = Err ×
∂Wj ∂Wj ∂Wj
= −Err × g ′(in) × xj
Simple weight update rule:
Wj ← Wj + α × Err × g ′(in) × xj
E.g., +ve error ⇒ increase network output
⇒ increase weights on +ve inputs, decrease on -ve inputs
Chapter 20 16
Perceptron learning
W = random initial values
for iter = 1 to T
for i = 1 to N (all examples)
~x = input for example i
y = output for example i
Wold = W
Err = y − g(Wold · ~x)
for j = 1 to M (all weights)
Wj = Wj + α · Err · g ′(Wold · ~x) · xj
Chapter 20 17
Perceptron learning contd.
Derivative of sigmoid g(x) can be written in simple form:
1
g(x) =
1 + e−x
g ′(x) = ?
Chapter 20 18
Perceptron learning contd.
Derivative of sigmoid g(x) can be written in simple form:
1
g(x) =
1 + e−x
′ e−x −x 2
g (x) = = e g(x)
(1 + e−x)2
Also,
1 −x −x 1 − g(x)
g(x) = ⇒ g(x) + e g(x) = 1 ⇒ e =
1 + e−x g(x)
So
′ 1 − g(x)
g (x) = g(x)2
g(x)
= (1 − g(x))g(x)
Chapter 20 19
Perceptron learning contd.
Perceptron learning rule converges to a consistent function
for any linearly separable data set
Proportion correct on test set
Chapter 20 20
Multilayer networks
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
Output units ai
Wj,i
Hidden units aj
Wk,j
Input units ak
Chapter 20 21
Expressiveness of MLPs
All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers
Chapter 20 22
Back-propagation learning
In general have n output nodes,
1X
E≡ Erri2,
2 i
where Erri = (yi − ai) and P
i runs over all nodes in the output layer.
Output layer: same as for single-layer perceptron,
Wj,i ← Wj,i + α × aj × ∆i
where ∆i = Err i × g ′(in i)
Hidden layers: back-propagate the error from the output layer:
∆j = g ′(in j ) Wj,i∆i .
X
Chapter 20 23
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai
= −(yi − ai)
∂Wj,i ∂Wj,i
Chapter 20 24
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
Chapter 20 25
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
′ ∂in i
= −(yi − ai)g (in i)
∂Wj,i
Chapter 20 26
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
∂in ∂
i
= −(yi − ai)g ′(in i) = −(yi − ai)g ′(in i) Wk,iaj
X
∂Wj,i ∂Wj,i k
Chapter 20 27
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
∂in ∂
i
= −(yi − ai)g ′(in i) = −(yi − ai)g ′(in i) Wk,iaj
X
∂Wj,i ∂Wj,i k
= −(yi − ai)g ′(in i)aj = −aj ∆i
Chapter 20 28
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E
= ?
∂Wk,j
Chapter 20 29
“Reminder”: Chain rule for partial derivatives
For f (x, y), with f differentiable wrt x and y, and x and y differentiable
wrt u and v:
∂f ∂f ∂x ∂f ∂y
= +
∂u ∂x ∂u ∂y ∂u
and
∂f ∂f ∂x ∂f ∂y
= +
∂v ∂x ∂v ∂y ∂v
Chapter 20 30
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂
= E(aj1 , aj2 , . . . , ajm )
∂Wk,j ∂Wk,j
where {ji} are the indices of the nodes in the same layer as node j.
Chapter 20 31
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
where P
i runs over all other nodes i in the same layer as node j.
Chapter 20 32
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
Chapter 20 33
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj
Chapter 20 34
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj
∂E
= ?
∂aj
Chapter 20 35
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj
∂E ∂
= E(ak1 , ak2 , . . . , akm )
∂aj ∂aj
where {ki} are the indices of the nodes in the layer after node j.
Chapter 20 36
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj
∂E ∂E ∂ak
=
X
Chapter 20 37
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj
∂E ∂E ∂ak
=
X
Chapter 20 38
Back-propagation derivation: hidden layer
If we define
∆j ≡ g ′(inj ) Wj,k ∆k
X
then
∂E
= −∆j ak
∂Wk,j
Chapter 20 39
Back-propagation pseudocode
for iter = 1 to T
for e = 1 to N (all examples)
~x = input for example e
~y = output for example e
run ~x forward through network, computing all {ai}, {ini}
for all weights (j, i) (in reverse order)
′
(yi − ai ) × g (ini ) if i is output node
compute ∆i = ′
g (ini) k Wi,k ∆k
P
o.w.
Wj,i = Wj,i + α × aj × ∆i
Chapter 20 40
Back-propagation learning contd.
At each epoch, sum gradient updates for all examples and apply
Restaurant data:
14
12
Total error on training set
10
0
0 50 100 150 200 250 300 350 400
Number of epochs
0.9
% correct on test set
0.8
0.7
0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size
Chapter 20 42
Handwritten digit recognition
Chapter 20 43
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
Perceptrons (one-layer networks) insufficiently expressive
Multi-layer networks are sufficiently expressive; can be trained by gradient
descent, i.e., error back-propagation
Many applications: speech, driving, handwriting, credit cards, etc.
Chapter 20 44