0% found this document useful (0 votes)
7 views44 pages

Chapter 20 B

Chapter 20 discusses neural networks, including their structure, types (such as perceptrons and multilayer networks), and learning mechanisms like back-propagation. It explains how neural networks can implement logical functions and the process of weight adjustment for learning. The chapter emphasizes the expressiveness of different network architectures and their applications in various domains.

Uploaded by

Will Tedjo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views44 pages

Chapter 20 B

Chapter 20 discusses neural networks, including their structure, types (such as perceptrons and multilayer networks), and learning mechanisms like back-propagation. It explains how neural networks can implement logical functions and the process of weight adjustment for learning. The chapter emphasizes the expressiveness of different network architectures and their applications in various domains.

Uploaded by

Will Tedjo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Neural networks

Chapter 20

Chapter 20 1
Outline
♦ Brains
♦ Neural networks
♦ Perceptrons
♦ Multilayer networks
♦ Applications of neural networks

Chapter 20 2
Brains
1011 neurons of > 20 types, 1014 synapses, 1ms–10ms cycle time
Signals are noisy “spike trains” of electrical potential

Axonal arborization

Axon from another cell

Synapse
Dendrite Axon

Nucleus

Synapses

Cell body or Soma

Chapter 20 3
McCulloch–Pitts “unit”
Output is a “squashed” linear function of the inputs:
ai ← g(ini) = g Σj Wj,iaj
 

Bias Weight
a0 = −1 ai = g(ini)
W0,i
g
ini
Σ
Wj,i
aj ai

Input Input Activation Output


Links Function Function Output Links

Chapter 20 4
Activation functions

g(ini) g(ini)

+1 +1

ini ini
(a) (b)

(a) is a step function or threshold function


(b) is a sigmoid function 1/(1 + e−x)
Changing the bias weight W0,i moves the threshold location

Chapter 20 5
Implementing logical functions

McCulloch and Pitts: every Boolean function can be implemented (with


large enough network)
AND?
OR?
NOT?
MAJORITY?

Chapter 20 6
Implementing logical functions

McCulloch and Pitts: every Boolean function can be implemented (with


large enough network)
W0 = 1.5 W0 = 0.5 W0 = – 0.5

W1 = 1 W1 = 1
W1 = –1
W2 = 1 W2 = 1

AND OR NOT

Chapter 20 7
Network structures
Feed-forward networks:
– single-layer perceptrons
– multi-layer networks
Feed-forward networks implement functions, have no internal state
Recurrent networks:
– Hopfield networks have symmetric weights (Wi,j = Wj,i)
g(x) = sign(x), ai = ± 1; holographic associative memory
– Boltzmann machines use stochastic activation functions,
≈ MCMC in BNs
– recurrent neural nets have directed cycles with delays
⇒ have internal state (like flip-flops), can oscillate etc.

Chapter 20 8
Feed-forward example

W1,3
1 3
W3,5
W1,4
5
W2,3 W4,5
2 4
W2,4

Feed-forward network = a parameterized family of nonlinear functions:


a5 = g(W3,5 · a3 + W4,5 · a4)
= g(W3,5 · g(W1,3 · a1 + W2,3 · a2) + W4,5 · g(W1,4 · a1 + W2,4 · a2))

Chapter 20 9
Perceptrons

Perceptron output
1
0.8
0.6
0.4
0.2 4
0 2
-4 -2 0 x2
-2
Input Output x1
0 2 -4
Wj,i 4
Units Units

Chapter 20 10
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc.
Represents a linear separator in input space:
Σj Wj xj > 0 or W · x > 0
I1 I1 I1

1 1 1

0 0 0
0 1 I2 0 1 I2 0 1 I2
(a) I1 and I2 (b) I1 or I2 (c) I1 xor I2

Chapter 20 11
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2

Chapter 20 12
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E
=?
∂Wj

Chapter 20 13
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂  n
y − g(Σj = 0Wj xj )

= Err × = Err ×
∂Wj ∂Wj ∂Wj

Chapter 20 14
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂  n
y − g(Σj = 0Wj xj )

= Err × = Err ×
∂Wj ∂Wj ∂Wj
= −Err × g ′(in) × xj

Chapter 20 15
Perceptron learning
Learn by adjusting weights to reduce error on training set
The squared error for an example with input x and true output y is
1 1
E = Err 2 ≡ (y − hW(x))2
2 2
Perform optimization search by gradient descent:
∂E ∂Err ∂  n
y − g(Σj = 0Wj xj )

= Err × = Err ×
∂Wj ∂Wj ∂Wj
= −Err × g ′(in) × xj
Simple weight update rule:
Wj ← Wj + α × Err × g ′(in) × xj
E.g., +ve error ⇒ increase network output
⇒ increase weights on +ve inputs, decrease on -ve inputs

Chapter 20 16
Perceptron learning
W = random initial values
for iter = 1 to T
for i = 1 to N (all examples)
~x = input for example i
y = output for example i
Wold = W
Err = y − g(Wold · ~x)
for j = 1 to M (all weights)
Wj = Wj + α · Err · g ′(Wold · ~x) · xj

Chapter 20 17
Perceptron learning contd.
Derivative of sigmoid g(x) can be written in simple form:
1
g(x) =
1 + e−x
g ′(x) = ?

Chapter 20 18
Perceptron learning contd.
Derivative of sigmoid g(x) can be written in simple form:
1
g(x) =
1 + e−x
′ e−x −x 2
g (x) = = e g(x)
(1 + e−x)2
Also,
1 −x −x 1 − g(x)
g(x) = ⇒ g(x) + e g(x) = 1 ⇒ e =
1 + e−x g(x)
So
′ 1 − g(x)
g (x) = g(x)2
g(x)
= (1 − g(x))g(x)

Chapter 20 19
Perceptron learning contd.
Perceptron learning rule converges to a consistent function
for any linearly separable data set
Proportion correct on test set

Proportion correct on test set


1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 Perceptron 0.6
Decision tree
0.5 0.5 Perceptron
Decision tree
0.4 0.4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
Training set size - MAJORITY on 11 inputs Training set size - RESTAURANT data

Chapter 20 20
Multilayer networks
Layers are usually fully connected;
numbers of hidden units typically chosen by hand
Output units ai

Wj,i

Hidden units aj

Wk,j

Input units ak

Chapter 20 21
Expressiveness of MLPs
All continuous functions w/ 1 hidden layer, all functions w/ 2 hidden layers

hW(x1, x2) hW(x1, x2)


1 1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 4 0.2 4
0 2 0 2
-4 -2 0 x2 -4 -2 0 x2
0 -2 0 -2
x1 2 -4 x1 2 -4
4 4

Chapter 20 22
Back-propagation learning
In general have n output nodes,
1X
E≡ Erri2,
2 i
where Erri = (yi − ai) and P
i runs over all nodes in the output layer.
Output layer: same as for single-layer perceptron,
Wj,i ← Wj,i + α × aj × ∆i
where ∆i = Err i × g ′(in i)
Hidden layers: back-propagate the error from the output layer:
∆j = g ′(in j ) Wj,i∆i .
X

Update rule for weights in hidden layers:


Wk,j ← Wk,j + α × ak × ∆j .

Chapter 20 23
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai
= −(yi − ai)
∂Wj,i ∂Wj,i

Chapter 20 24
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i

Chapter 20 25
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
′ ∂in i
= −(yi − ai)g (in i)
∂Wj,i

Chapter 20 26
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
∂in ∂
 
i
= −(yi − ai)g ′(in i) = −(yi − ai)g ′(in i) Wk,iaj 
 X

∂Wj,i ∂Wj,i k

Chapter 20 27
Back-propagation derivation
For a node i in the output layer:
∂E ∂ai ∂g(in i)
= −(yi − ai) = −(yi − ai)
∂Wj,i ∂Wj,i ∂Wj,i
∂in ∂
 
i
= −(yi − ai)g ′(in i) = −(yi − ai)g ′(in i) Wk,iaj 
 X

∂Wj,i ∂Wj,i k
= −(yi − ai)g ′(in i)aj = −aj ∆i

where ∆i = (yi − ai)g ′(in i)

Chapter 20 28
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E
= ?
∂Wk,j

Chapter 20 29
“Reminder”: Chain rule for partial derivatives
For f (x, y), with f differentiable wrt x and y, and x and y differentiable
wrt u and v:
∂f ∂f ∂x ∂f ∂y
= +
∂u ∂x ∂u ∂y ∂u

and

∂f ∂f ∂x ∂f ∂y
= +
∂v ∂x ∂v ∂y ∂v

Chapter 20 30
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂
= E(aj1 , aj2 , . . . , ajm )
∂Wk,j ∂Wk,j
where {ji} are the indices of the nodes in the same layer as node j.

Chapter 20 31
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j

where P
i runs over all other nodes i in the same layer as node j.

Chapter 20 32
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j

Chapter 20 33
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj

Chapter 20 34
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj

∂E
= ?
∂aj

Chapter 20 35
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj

∂E ∂
= E(ak1 , ak2 , . . . , akm )
∂aj ∂aj
where {ki} are the indices of the nodes in the layer after node j.

Chapter 20 36
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj

∂E ∂E ∂ak
=
X

∂aj k ∂ak ∂aj


where P
k runs over all nodes k that node j connects to.

Chapter 20 37
Back-propagation derivation: hidden layer
For a node j in a hidden layer:
∂E ∂E ∂aj X ∂E ∂ai
= +
∂Wk,j ∂aj ∂Wk,j i ∂ai ∂Wk,j
∂E ∂aj ∂ai
= since = 0 for i 6= j
∂aj ∂Wk,j ∂Wk,j
∂E ′
= · g (inj )ak
∂aj

∂E ∂E ∂ak
=
X

∂aj k ∂ak ∂aj


X ∂E ′
= g (ink )Wj,k
k ∂ak

Chapter 20 38
Back-propagation derivation: hidden layer
If we define
∆j ≡ g ′(inj ) Wj,k ∆k
X

then
∂E
= −∆j ak
∂Wk,j

Chapter 20 39
Back-propagation pseudocode

for iter = 1 to T
for e = 1 to N (all examples)
~x = input for example e
~y = output for example e
run ~x forward through network, computing all {ai}, {ini}
for all weights (j, i) (in reverse order)


 (yi − ai ) × g (ini ) if i is output node


compute ∆i =  ′
g (ini) k Wi,k ∆k
P
o.w.
Wj,i = Wj,i + α × aj × ∆i

Chapter 20 40
Back-propagation learning contd.
At each epoch, sum gradient updates for all examples and apply
Restaurant data:

14

12
Total error on training set

10

0
0 50 100 150 200 250 300 350 400
Number of epochs

Usual problems with slow convergence, local minima


Chapter 20 41
Back-propagation learning contd.
Restaurant data:

0.9
% correct on test set
0.8

0.7

0.6 Multilayer network


Decision tree
0.5

0.4
0 10 20 30 40 50 60 70 80 90 100
Training set size

Chapter 20 42
Handwritten digit recognition

3-nearest-neighbor = 2.4% error


400–300–10 unit MLP = 1.6% error
LeNet: 768–192–30–10 unit MLP = 0.9% error

Chapter 20 43
Summary
Most brains have lots of neurons; each neuron ≈ linear–threshold unit (?)
Perceptrons (one-layer networks) insufficiently expressive
Multi-layer networks are sufficiently expressive; can be trained by gradient
descent, i.e., error back-propagation
Many applications: speech, driving, handwriting, credit cards, etc.

Chapter 20 44

You might also like