0% found this document useful (0 votes)
8 views24 pages

MLP Chap11

Chapter 11 of 'Introduction to Machine Learning' discusses Multilayer Perceptrons (MLPs) and their architecture, including the structure of neurons and connections that enable parallel processing and robust computation. It covers the training processes for MLPs, including online and batch learning, as well as the backpropagation algorithm for optimizing weights. The chapter also explores concepts like autoencoders and the application of MLPs in representing complex functions such as XOR.

Uploaded by

panzy111030003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views24 pages

MLP Chap11

Chapter 11 of 'Introduction to Machine Learning' discusses Multilayer Perceptrons (MLPs) and their architecture, including the structure of neurons and connections that enable parallel processing and robust computation. It covers the training processes for MLPs, including online and batch learning, as well as the backpropagation algorithm for optimizing weights. The chapter also explores concepts like autoencoders and the application of MLPs in representing complex functions such as XOR.

Uploaded by

panzy111030003
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Lecture Slides for

INTRODUCTION TO

Machine Learning 4e
ETHEM ALPAYDIN
The MIT Press, 2020

[email protected]
CHAPTER 11:

Multilayer Perceptrons
Neural Networks 3

• Networks of processing units (neurons) with


connections (synapses) between them
• Large number of neurons: 1010
• Large connectitivity: 105
• Parallel processing
• Distributed computation/memory
• Robust to noise, failures
Understanding the Brain 4

• Levels of analysis (Marr, 1982)


1. Computational theory
2. Representation and algorithm
3. Hardware implementation
• Reverse engineering: From hardware to
theory
• Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local
memory
Learning: Update by training/experience
Perceptron 5
d
y = å w j x j + w0 = w T x
j =1

w = [w 0 , w1 ,...,wd ]T
x = [1, x1 ,..., xd ]
T

(Rosenblatt, 1962)
What a Perceptron Does 6

• Regression: y=wx+w0 • Classification:y=1(wx+w0>0)

y y y
s
w0 w0
w w
x w0
x x
x0=+1 1
y = sigmoid (o ) =
[
1 + exp - w T x ]
K Outputs 7
Regression:
d
y i = å w ij x j + w i 0 = w Ti x
j =1

y = Wx
Classification:
oi = w Ti x
expoi
yi =
åk expok
choose C i
if y i = max y k
k
Training 8
• Online (instances seen one by one) vs batch (whole
sample) learning:
• No need to store the whole sample
• Problem may change in time
• Wear and degradation in system components
• Stochastic gradient-descent: Update after a single
pattern
• Generic update rule (LMS rule):
Dw ijt = h (rit - y it )x tj
Update =LearningFactor×( DesiredOutput - ActualOutput ) ×Input
Training a Perceptron:
Regression 9

• Regression (Linear output):

t t t 1 t
2
[
E (w | x , r ) = (r - y ) = r - (w x )
t 2 1 t
2
]
T t 2

Dw tj = h (r t - y t )x tj
Classification 10

• Single sigmoid output


y t = sigmoid (w T xt )
E t (w | xt , r t ) = -r t log y t - (1 - r t ) log (1 - y t )
Dw tj = h (r t - y t )x tj

• K>2 softmax outputs


T t
E t ({w i }i | xt , r t ) = -å rit log y it
exp w i x
yt =
åk exp w T t
kx i

Dw ijt = h (rit - y it )x tj
Learning Boolean AND 11
XOR 12

• No w0, w1, w2 satisfy:

w0 £0
w 2 + w0 >0
w1 + w0 >0
w1 + w 2 + w 0 £0 (Minsky and Papert, 1969)
Multilayer Perceptrons 13

H
y i = v Ti z = å v ih zh + v i 0
h =1

zh = sigmoid (w Th x )
1
=
1 + exp -[ (å d
j =1
w hj x j + w h 0 )]
(Rumelhart et al., 1986)
286 11

14
Table 11.3 The MLP that implements XOR with two h
ment the two ANDs and the output that takes an OR of

x1 x2 z1 z2 y
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0

One is not limited to having one hidden layer, an


with their own incoming weights can be placed afte
with sigmoid hidden units, thus calculating nonli
first layer of hidden units and implementing more
the inputs. We will discuss such “deep” networks i

11.6 MLP as a Universal Approximator


x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
We can represent any Boolean function as a disjun
Backpropagation 15

H
y i = v z = å v ih zh + v i 0
T
i
h =1

zh = sigmoid (w Th x )
1
=
[ (å
1 + exp -
d
j =1
w hj x j + w h 0 )]
¶E ¶E ¶y i ¶zh
=
¶w hj ¶y i ¶zh ¶w hj
Regression 16

E (W , v | X ) = å r - y )
(
1 t t 2

2 t

H
Dv h = å (r t - y t )zht
y t = å v h zht + v 0 t
h =1 Backward
Forward ¶E
Dw hj = -h
¶w hj
zh = sigmoid (w Th x ) ¶E ¶y t ¶zht
= -h å t t
t ¶y ¶z h ¶w hj

= -h å - (r t - y t )v h zht (1 - zht )x tj
x t

= h å (r t - y t )v h zht (1 - zht )x tj
t
Regression with Multiple
Outputs 17

E (W ,V | X ) = åå (ri - y i )
1 t t 2 yi
2 t i
H
y it = å v ih zht +v i 0 vih
h =1

Dv ih = h å (rit - y it )zht zh
t
whj
é ù t
Dw hj = h å êå (ri - y i )v ih úzh (1 - zht )x tj
t t
xj
t ë i û
18
1d Regression: Convergence 19
h=1
solved using a linear model.
nctions where φh (x) are the basis Onefunctions
possibilityand we us
is for have H of them.
to specify These
the basis ba-
functions ourselve
sis functions define an H-dimensional
whatever space where
transformations of thethe problem
input can be
we believe will help. Poly

Learning Hidden
solved using a linear model.
basis functions, for example, are generally useful. Such fixed, p
One possibility is for

Representations
whatever transformations
us basis
ified
discuss
to specify
of in
the
the basis
functions
input 14.
chapter
functionssupport
also underlie ourselves
vector
we believe will help. Polynomial 26
using
machines that

As we discussed
basis functions, for example, before,
are generally an MLP
useful. can fixed,
Such be written similarly. Equatio
prespec-
becomes
ified basis functions also underlie support vector machines that we will
discuss in chapter 14. H
MLPAsiswea discussed
generalized y linear hmodel h )where
writtenhidden units
!
• (11.35) before,
= an vMLP can be
φ(x|w similarly. Equation 11.34
are the nonlinear basish=1
becomes functions:
H
! where
(11.35) y= vh φ(x|w h )
h=1 φ(x|w h ) ≡ sigmoid(w Th x)

• The advantage is thatAllthe


where thebasis
basis function parameters
functions have the same can
functional form, but the
also be learned from T data.
parameters that are modifiable, which in the case of the MLP corre
φ(x|w h ) ≡ sigmoid(wto h x)
the incoming weights of the hidden units (in the first layer). In
• The hidden units, zhter , learn
13, weawill
code/embedding,
discuss the radial abasis function network, which
All the basis functions
representation in the have the
hidden same functional form, but they have
space
written as equation 11.35, but φ(x|w h ) is defined differently.
parameters that are modifiable, which in the case of the MLP corresponds
• Transfer learning:
to the incoming Use of
weights code
The in another
advantage
the hidden units task
of an MLP is that we can learn w h (using back
(in the first layer). In chap-
gation); that is, we can finetune the basis functions to the data
• Semisupervised learning:
ter 13, we will discuss Transfer
the radial
of assuming themfrom
basis functionan Geometrically
a priori. unsupervised
network, which is also
speaking, we learn the
written as equation
to supervised problem 11.35, but φ(x|w h ) is defined differently.
sions of a new H-dimensional space such that the problem is linea
The advantage of an MLP is that we can learn w h (using backpropa-
mapped there. In the case of XOR, for example (see section 11
gation); that is, we can finetune the basis functions to the data instead
problem is not linearly separable in the input space, but with s
of assuming them a priori. Geometrically speaking, we learn the dimen-
input
hidden(see figure
units 11.15).
is less than When the number
the number of hidden
of inputs, units is
the multilayer
the
trainednumber of inputs,
to find the bestthis implies
coding dimensionality
of the inputs on thereduction.
hidden uni
The first layer
dimensionality from the input to the hidden layer acts as th
reduction.
and the values of the hidden units make up the code. The ML
to find the best representation of the input in the hidden lay

Autoencoders 11.10 be
28
able to reproduce it again at the output layer. The encoder
Autoencoders
as
autoencoder An interesting architecture is the autoencoder (Cottrell, Mun
oencoders t t 301
(11.36) z = Enc(x
1987), which |W)is an MLP architecture where there are as man
there are inputs, and the required output is defined to be
where W are the parameters of the encoder. The second laye
input (see figure 11.15). When the number of hidden unit
^ hidden units to the output units acts as the decoder, reconstr
X the number of inputs, this implies dimensionality reduction
original signal from its encoded representation:
The first layer from the input to the hidden layer acts as
(11.37) x̂and
t
= the values
Dec(z t
|V) of the hidden units make up the code. The
Decoder to find the best representation of the input in the hidden l
where
be ableV to arereproduce
the parameters of the
it again decoder.
at the output layer. The enco
z
as
11 Multilayer Perceptrons
Encoder z t = Enc(x t |W)
(11.36)

We use backpropagation to find the where W are the


best encoder parameters
and decoder of the encoder. The second la
parame-
x hidden units to the output units acts as the decoder, recon
uction ters so that the reconstruction error is minimized:
error original signal from its encoded representation:
! t 2
!
t 2
11.38)
15 In the E(W, V|X) =there∥x
autoencoder, are−asx̂many ∥x t −
∥ =outputs
(11.37) asDec(Enc(x
x̂ t
there
t
t |W)|V)∥
are inputs,
= Dec(z |V)
t
sired outputs are set to bet equal to the inputs. When the number of
ts is less than the numbertheof inputs, thewe where V are the
multilayer parameters of the decoder.
In designing encoder, choose perceptron
a code of issmall dimensionality
• Variants:
find the best Denoising,
coding of the inputs on sparse autoencoders
the hidden units, performing
and/or use an encoder of low capacity to force the encoder to preserve
ality reduction.
the most important characteristics of the data. The part that is discarded
is due to noise or factors that are unimportant for the task. This is exactly
an autoencoder. Given a sentence and a center word, the input is
context and the output is the center word. The context uses a win
that includes a number of words on both sides of the center word
example, five on each side.
Word2vec
CBOW 29
The two models differ in the way they define the context. In CB
short for continuous bag of words, we average the binary representat
of all the words in the window and that is given as input. That is,
d-dimensional input will have nonzero values for all the words that
• Learn an
skip-gram
embedding
part for words
of the context. for NLP we take the words in the context
In skip-gram,
• Skipgram:
at aAn autencoder
time, and they with
formlinear encoder
different andpairs where, again, the ai
training
decoder where input is the center word and output is a
contexttoword
predict the center word. Thus, in skip-gram there is only one non
element in each input pair, but there will be multiple such pairs.
For example, assume we have the sentence

“Protesters in Paris clash with the French police.”

and “Paris” as the center word.


In CBOW, the input is the bag ofContext
words representation of [“Protest
Center
“in”, “clash”, “with”, “the”, “French”, “police”] and the target is the
• Similar of words
words representation
appear in similarof “Paris”. In
contexts, so skip-gram,
similar this sentence gener
codes will
sevenbe training
learned pairs;
for them
the first has “Protesters” as input and “Paris
target, the second has “in” as input and “Paris” as target, the third
“clash” as input and “Paris” as target, up until the last one, which
“police” as input and “Paris” as target. The center word slides, and
Vector Algebra 30
307

Because they will always appear


in similar contexts in pairs in a
"English"
large corpus, we expect
vec(“English”)-vec(“London”)
"French"
to be similar to
vec(“French”)-vec(“Paris”)
"London"
so,
vec(”Paris”) + vec(“English”)-
"Paris" vec(“London”)
will be similar to
vec(“French”)

2vec example (hypothetical). In the code space, because


Two-Class Discrimination 21

• One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

æ H ö
y = sigmoid ç å v h zh + v 0 ÷
t t

è h=1 ø
E (W , v | X ) = -å r t log y t + (1 - r t ) log (1 - y t )
t

Dv h = h å (r t - y t )zht
t

Dw hj = h å (r t - y t )v h zht (1 - zht )x tj
t

You might also like