MLP Chap11
MLP Chap11
INTRODUCTION TO
Machine Learning 4e
ETHEM ALPAYDIN
The MIT Press, 2020
[email protected]
CHAPTER 11:
Multilayer Perceptrons
Neural Networks 3
w = [w 0 , w1 ,...,wd ]T
x = [1, x1 ,..., xd ]
T
(Rosenblatt, 1962)
What a Perceptron Does 6
y y y
s
w0 w0
w w
x w0
x x
x0=+1 1
y = sigmoid (o ) =
[
1 + exp - w T x ]
K Outputs 7
Regression:
d
y i = å w ij x j + w i 0 = w Ti x
j =1
y = Wx
Classification:
oi = w Ti x
expoi
yi =
åk expok
choose C i
if y i = max y k
k
Training 8
• Online (instances seen one by one) vs batch (whole
sample) learning:
• No need to store the whole sample
• Problem may change in time
• Wear and degradation in system components
• Stochastic gradient-descent: Update after a single
pattern
• Generic update rule (LMS rule):
Dw ijt = h (rit - y it )x tj
Update =LearningFactor×( DesiredOutput - ActualOutput ) ×Input
Training a Perceptron:
Regression 9
t t t 1 t
2
[
E (w | x , r ) = (r - y ) = r - (w x )
t 2 1 t
2
]
T t 2
Dw tj = h (r t - y t )x tj
Classification 10
Dw ijt = h (rit - y it )x tj
Learning Boolean AND 11
XOR 12
w0 £0
w 2 + w0 >0
w1 + w0 >0
w1 + w 2 + w 0 £0 (Minsky and Papert, 1969)
Multilayer Perceptrons 13
H
y i = v Ti z = å v ih zh + v i 0
h =1
zh = sigmoid (w Th x )
1
=
1 + exp -[ (å d
j =1
w hj x j + w h 0 )]
(Rumelhart et al., 1986)
286 11
14
Table 11.3 The MLP that implements XOR with two h
ment the two ANDs and the output that takes an OR of
x1 x2 z1 z2 y
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0
H
y i = v z = å v ih zh + v i 0
T
i
h =1
zh = sigmoid (w Th x )
1
=
[ (å
1 + exp -
d
j =1
w hj x j + w h 0 )]
¶E ¶E ¶y i ¶zh
=
¶w hj ¶y i ¶zh ¶w hj
Regression 16
E (W , v | X ) = å r - y )
(
1 t t 2
2 t
H
Dv h = å (r t - y t )zht
y t = å v h zht + v 0 t
h =1 Backward
Forward ¶E
Dw hj = -h
¶w hj
zh = sigmoid (w Th x ) ¶E ¶y t ¶zht
= -h å t t
t ¶y ¶z h ¶w hj
= -h å - (r t - y t )v h zht (1 - zht )x tj
x t
= h å (r t - y t )v h zht (1 - zht )x tj
t
Regression with Multiple
Outputs 17
E (W ,V | X ) = åå (ri - y i )
1 t t 2 yi
2 t i
H
y it = å v ih zht +v i 0 vih
h =1
Dv ih = h å (rit - y it )zht zh
t
whj
é ù t
Dw hj = h å êå (ri - y i )v ih úzh (1 - zht )x tj
t t
xj
t ë i û
18
1d Regression: Convergence 19
h=1
solved using a linear model.
nctions where φh (x) are the basis Onefunctions
possibilityand we us
is for have H of them.
to specify These
the basis ba-
functions ourselve
sis functions define an H-dimensional
whatever space where
transformations of thethe problem
input can be
we believe will help. Poly
Learning Hidden
solved using a linear model.
basis functions, for example, are generally useful. Such fixed, p
One possibility is for
Representations
whatever transformations
us basis
ified
discuss
to specify
of in
the
the basis
functions
input 14.
chapter
functionssupport
also underlie ourselves
vector
we believe will help. Polynomial 26
using
machines that
As we discussed
basis functions, for example, before,
are generally an MLP
useful. can fixed,
Such be written similarly. Equatio
prespec-
becomes
ified basis functions also underlie support vector machines that we will
discuss in chapter 14. H
MLPAsiswea discussed
generalized y linear hmodel h )where
writtenhidden units
!
• (11.35) before,
= an vMLP can be
φ(x|w similarly. Equation 11.34
are the nonlinear basish=1
becomes functions:
H
! where
(11.35) y= vh φ(x|w h )
h=1 φ(x|w h ) ≡ sigmoid(w Th x)
Autoencoders 11.10 be
28
able to reproduce it again at the output layer. The encoder
Autoencoders
as
autoencoder An interesting architecture is the autoencoder (Cottrell, Mun
oencoders t t 301
(11.36) z = Enc(x
1987), which |W)is an MLP architecture where there are as man
there are inputs, and the required output is defined to be
where W are the parameters of the encoder. The second laye
input (see figure 11.15). When the number of hidden unit
^ hidden units to the output units acts as the decoder, reconstr
X the number of inputs, this implies dimensionality reduction
original signal from its encoded representation:
The first layer from the input to the hidden layer acts as
(11.37) x̂and
t
= the values
Dec(z t
|V) of the hidden units make up the code. The
Decoder to find the best representation of the input in the hidden l
where
be ableV to arereproduce
the parameters of the
it again decoder.
at the output layer. The enco
z
as
11 Multilayer Perceptrons
Encoder z t = Enc(x t |W)
(11.36)
æ H ö
y = sigmoid ç å v h zh + v 0 ÷
t t
è h=1 ø
E (W , v | X ) = -å r t log y t + (1 - r t ) log (1 - y t )
t
Dv h = h å (r t - y t )zht
t
Dw hj = h å (r t - y t )v h zht (1 - zht )x tj
t