0% found this document useful (0 votes)

8 views24 pages

MLP Chap11

Chapter 11 of 'Introduction to Machine Learning' discusses Multilayer Perceptrons (MLPs) and their architecture, including the structure of neurons and connections that enable parallel processing and robust computation. It covers the training processes for MLPs, including online and batch learning, as well as the backpropagation algorithm for optimizing weights. The chapter also explores concepts like autoencoders and the application of MLPs in representing complex functions such as XOR.

Uploaded by

panzy111030003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views24 pages

MLP Chap11

Uploaded by

panzy111030003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Lecture Slides for

INTRODUCTION TO

Machine Learning 4e
ETHEM ALPAYDIN
The MIT Press, 2020

[email protected]
CHAPTER 11:

Multilayer Perceptrons
Neural Networks 3

• Networks of processing units (neurons) with

connections (synapses) between them
• Large number of neurons: 1010
• Large connectitivity: 105
• Parallel processing
• Distributed computation/memory
• Robust to noise, failures
Understanding the Brain 4

• Levels of analysis (Marr, 1982)

1. Computational theory
2. Representation and algorithm
3. Hardware implementation
• Reverse engineering: From hardware to
theory
• Parallel processing: SIMD vs MIMD
Neural net: SIMD with modifiable local
memory
Learning: Update by training/experience
Perceptron 5
d
y = å w j x j + w0 = w T x
j =1

w = [w 0 , w1 ,...,wd ]T
x = [1, x1 ,..., xd ]
T

(Rosenblatt, 1962)
What a Perceptron Does 6

• Regression: y=wx+w0 • Classification:y=1(wx+w0>0)

y y y
s
w0 w0
w w
x w0
x x
x0=+1 1
y = sigmoid (o ) =
[
1 + exp - w T x ]
K Outputs 7
Regression:
d
y i = å w ij x j + w i 0 = w Ti x
j =1

y = Wx
Classification:
oi = w Ti x
expoi
yi =
åk expok
choose C i
if y i = max y k
k
Training 8
• Online (instances seen one by one) vs batch (whole
sample) learning:
• No need to store the whole sample
• Problem may change in time
• Wear and degradation in system components
• Stochastic gradient-descent: Update after a single
pattern
• Generic update rule (LMS rule):
Dw ijt = h (rit - y it )x tj
Update =LearningFactor×( DesiredOutput - ActualOutput ) ×Input
Training a Perceptron:
Regression 9

• Regression (Linear output):

t t t 1 t
2
[
E (w | x , r ) = (r - y ) = r - (w x )
t 2 1 t
2
]
T t 2

Dw tj = h (r t - y t )x tj
Classification 10

• Single sigmoid output

y t = sigmoid (w T xt )
E t (w | xt , r t ) = -r t log y t - (1 - r t ) log (1 - y t )
Dw tj = h (r t - y t )x tj

• K>2 softmax outputs

T t
E t ({w i }i | xt , r t ) = -å rit log y it
exp w i x
yt =
åk exp w T t
kx i

Dw ijt = h (rit - y it )x tj
Learning Boolean AND 11
XOR 12

• No w0, w1, w2 satisfy:

w0 £0
w 2 + w0 >0
w1 + w0 >0
w1 + w 2 + w 0 £0 (Minsky and Papert, 1969)
Multilayer Perceptrons 13

H
y i = v Ti z = å v ih zh + v i 0
h =1

zh = sigmoid (w Th x )
1
=
1 + exp -[ (å d
j =1
w hj x j + w h 0 )]
(Rumelhart et al., 1986)
286 11

14
Table 11.3 The MLP that implements XOR with two h
ment the two ANDs and the output that takes an OR of

x1 x2 z1 z2 y
0 0 0 0 0
0 1 0 1 1
1 0 1 0 1
1 1 0 0 0

One is not limited to having one hidden layer, an

with their own incoming weights can be placed afte
with sigmoid hidden units, thus calculating nonli
first layer of hidden units and implementing more
the inputs. We will discuss such “deep” networks i

11.6 MLP as a Universal Approximator

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
We can represent any Boolean function as a disjun
Backpropagation 15

H
y i = v z = å v ih zh + v i 0
T
i
h =1

zh = sigmoid (w Th x )
1
=
[ (å
1 + exp -
d
j =1
w hj x j + w h 0 )]
¶E ¶E ¶y i ¶zh
=
¶w hj ¶y i ¶zh ¶w hj
Regression 16

E (W , v | X ) = å r - y )
(
1 t t 2

2 t

H
Dv h = å (r t - y t )zht
y t = å v h zht + v 0 t
h =1 Backward
Forward ¶E
Dw hj = -h
¶w hj
zh = sigmoid (w Th x ) ¶E ¶y t ¶zht
= -h å t t
t ¶y ¶z h ¶w hj

= -h å - (r t - y t )v h zht (1 - zht )x tj
x t

= h å (r t - y t )v h zht (1 - zht )x tj
t
Regression with Multiple
Outputs 17

E (W ,V | X ) = åå (ri - y i )
1 t t 2 yi
2 t i
H
y it = å v ih zht +v i 0 vih
h =1

Dv ih = h å (rit - y it )zht zh
t
whj
é ù t
Dw hj = h å êå (ri - y i )v ih úzh (1 - zht )x tj
t t
xj
t ë i û
18
1d Regression: Convergence 19
h=1
solved using a linear model.
nctions where φh (x) are the basis Onefunctions
possibilityand we us
is for have H of them.
to specify These
the basis ba-
functions ourselve
sis functions define an H-dimensional
whatever space where
transformations of thethe problem
input can be
we believe will help. Poly

Learning Hidden
solved using a linear model.
basis functions, for example, are generally useful. Such fixed, p
One possibility is for

Representations
whatever transformations
us basis
ified
discuss
to specify
of in
the
the basis
functions
input 14.
chapter
functionssupport
also underlie ourselves
vector
we believe will help. Polynomial 26
using
machines that

As we discussed
basis functions, for example, before,
are generally an MLP
useful. can fixed,
Such be written similarly. Equatio
prespec-
becomes
ified basis functions also underlie support vector machines that we will
discuss in chapter 14. H
MLPAsiswea discussed
generalized y linear hmodel h )where
writtenhidden units
!
• (11.35) before,
= an vMLP can be
φ(x|w similarly. Equation 11.34
are the nonlinear basish=1
becomes functions:
H
! where
(11.35) y= vh φ(x|w h )
h=1 φ(x|w h ) ≡ sigmoid(w Th x)

• The advantage is thatAllthe

where thebasis
basis function parameters
functions have the same can
functional form, but the
also be learned from T data.
parameters that are modifiable, which in the case of the MLP corre
φ(x|w h ) ≡ sigmoid(wto h x)
the incoming weights of the hidden units (in the first layer). In
• The hidden units, zhter , learn
13, weawill
code/embedding,
discuss the radial abasis function network, which
All the basis functions
representation in the have the
hidden same functional form, but they have
space
written as equation 11.35, but φ(x|w h ) is defined diﬀerently.
parameters that are modifiable, which in the case of the MLP corresponds
• Transfer learning:
to the incoming Use of
weights code
The in another
advantage
the hidden units task
of an MLP is that we can learn w h (using back
(in the first layer). In chap-
gation); that is, we can finetune the basis functions to the data
• Semisupervised learning:
ter 13, we will discuss Transfer
the radial
of assuming themfrom
basis functionan Geometrically
a priori. unsupervised
network, which is also
speaking, we learn the
written as equation
to supervised problem 11.35, but φ(x|w h ) is defined diﬀerently.
sions of a new H-dimensional space such that the problem is linea
The advantage of an MLP is that we can learn w h (using backpropa-
mapped there. In the case of XOR, for example (see section 11
gation); that is, we can finetune the basis functions to the data instead
problem is not linearly separable in the input space, but with s
of assuming them a priori. Geometrically speaking, we learn the dimen-
input
hidden(see figure
units 11.15).
is less than When the number
the number of hidden
of inputs, units is
the multilayer
the
trainednumber of inputs,
to find the bestthis implies
coding dimensionality
of the inputs on thereduction.
hidden uni
The first layer
dimensionality from the input to the hidden layer acts as th
reduction.
and the values of the hidden units make up the code. The ML
to find the best representation of the input in the hidden lay

Autoencoders 11.10 be
28
able to reproduce it again at the output layer. The encoder
Autoencoders
as
autoencoder An interesting architecture is the autoencoder (Cottrell, Mun
oencoders t t 301
(11.36) z = Enc(x
1987), which |W)is an MLP architecture where there are as man
there are inputs, and the required output is defined to be
where W are the parameters of the encoder. The second laye
input (see figure 11.15). When the number of hidden unit
^ hidden units to the output units acts as the decoder, reconstr
X the number of inputs, this implies dimensionality reduction
original signal from its encoded representation:
The first layer from the input to the hidden layer acts as
(11.37) x̂and
t
= the values
Dec(z t
|V) of the hidden units make up the code. The
Decoder to find the best representation of the input in the hidden l
where
be ableV to arereproduce
the parameters of the
it again decoder.
at the output layer. The enco
z
as
11 Multilayer Perceptrons
Encoder z t = Enc(x t |W)
(11.36)

We use backpropagation to find the where W are the

best encoder parameters
and decoder of the encoder. The second la
parame-
x hidden units to the output units acts as the decoder, recon
uction ters so that the reconstruction error is minimized:
error original signal from its encoded representation:
! t 2
!
t 2
11.38)
15 In the E(W, V|X) =there∥x
autoencoder, are−asx̂many ∥x t −
∥ =outputs
(11.37) asDec(Enc(x
x̂ t
there
t
t |W)|V)∥
are inputs,
= Dec(z |V)
t
sired outputs are set to bet equal to the inputs. When the number of
ts is less than the numbertheof inputs, thewe where V are the
multilayer parameters of the decoder.
In designing encoder, choose perceptron
a code of issmall dimensionality
• Variants:
find the best Denoising,
coding of the inputs on sparse autoencoders
the hidden units, performing
and/or use an encoder of low capacity to force the encoder to preserve
ality reduction.
the most important characteristics of the data. The part that is discarded
is due to noise or factors that are unimportant for the task. This is exactly
an autoencoder. Given a sentence and a center word, the input is
context and the output is the center word. The context uses a win
that includes a number of words on both sides of the center word
example, five on each side.
Word2vec
CBOW 29
The two models diﬀer in the way they define the context. In CB
short for continuous bag of words, we average the binary representat
of all the words in the window and that is given as input. That is,
d-dimensional input will have nonzero values for all the words that
• Learn an
skip-gram
embedding
part for words
of the context. for NLP we take the words in the context
In skip-gram,
• Skipgram:
at aAn autencoder
time, and they with
formlinear encoder
diﬀerent andpairs where, again, the ai
training
decoder where input is the center word and output is a
contexttoword
predict the center word. Thus, in skip-gram there is only one non
element in each input pair, but there will be multiple such pairs.
For example, assume we have the sentence

“Protesters in Paris clash with the French police.”

and “Paris” as the center word.

In CBOW, the input is the bag ofContext
words representation of [“Protest
Center
“in”, “clash”, “with”, “the”, “French”, “police”] and the target is the
• Similar of words
words representation
appear in similarof “Paris”. In
contexts, so skip-gram,
similar this sentence gener
codes will
sevenbe training
learned pairs;
for them
the first has “Protesters” as input and “Paris
target, the second has “in” as input and “Paris” as target, the third
“clash” as input and “Paris” as target, up until the last one, which
“police” as input and “Paris” as target. The center word slides, and
Vector Algebra 30
307

Because they will always appear

in similar contexts in pairs in a
"English"
large corpus, we expect
vec(“English”)-vec(“London”)
"French"
to be similar to
vec(“French”)-vec(“Paris”)
"London"
so,
vec(”Paris”) + vec(“English”)-
"Paris" vec(“London”)
will be similar to
vec(“French”)

2vec example (hypothetical). In the code space, because

Two-Class Discrimination 21

• One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

æ H ö
y = sigmoid ç å v h zh + v 0 ÷
t t

è h=1 ø
E (W , v | X ) = -å r t log y t + (1 - r t ) log (1 - y t )
t

Dv h = h å (r t - y t )zht
t

Dw hj = h å (r t - y t )v h zht (1 - zht )x tj
t

DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
Lecture-4 Multi-Layer Perceptrons
No ratings yet
Lecture-4 Multi-Layer Perceptrons
23 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
38 pages
DL CS05
No ratings yet
DL CS05
22 pages
DL Notes ALL
No ratings yet
DL Notes ALL
63 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
JSS List Per County
100% (2)
JSS List Per County
29 pages
I2ml3e Chap11
No ratings yet
I2ml3e Chap11
38 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Week-12 - Introduction To ML-NN-CNN
No ratings yet
Week-12 - Introduction To ML-NN-CNN
45 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Non-Linear Classifiers
No ratings yet
Non-Linear Classifiers
19 pages
DL Question Bank Answers
No ratings yet
DL Question Bank Answers
55 pages
DL - Unit II
No ratings yet
DL - Unit II
78 pages
MAT6007 - Session6 - Multilayer Perceptrons
No ratings yet
MAT6007 - Session6 - Multilayer Perceptrons
13 pages
MLP 1122 20240509 ch10 DeepNN
No ratings yet
MLP 1122 20240509 ch10 DeepNN
47 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Neural Network: Prof. Subodh Kumar Mohanty
No ratings yet
Neural Network: Prof. Subodh Kumar Mohanty
37 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
ML 03
No ratings yet
ML 03
42 pages
Lecture 16-Multilayer Perceptron
No ratings yet
Lecture 16-Multilayer Perceptron
24 pages
Lecture 2
No ratings yet
Lecture 2
52 pages
Unit 4 ML NN, DL, CNN-1
No ratings yet
Unit 4 ML NN, DL, CNN-1
84 pages
Unit 3
100% (1)
Unit 3
11 pages
P5 Neural Nets
No ratings yet
P5 Neural Nets
114 pages
08 NN
No ratings yet
08 NN
117 pages
Aiml Unit 5
No ratings yet
Aiml Unit 5
34 pages
Fast Training of Multilayer Perceptrons
No ratings yet
Fast Training of Multilayer Perceptrons
15 pages
Basics of Deep Learning
No ratings yet
Basics of Deep Learning
20 pages
Ethem Alpaydin-Introduction To Machine Learning-The MIT Press (2014) (330-333)
No ratings yet
Ethem Alpaydin-Introduction To Machine Learning-The MIT Press (2014) (330-333)
4 pages
Multilayer Neural Network
No ratings yet
Multilayer Neural Network
27 pages
Machine Learning and Data Mining: Prof. Alexander Ihler
No ratings yet
Machine Learning and Data Mining: Prof. Alexander Ihler
29 pages
Neural Network
No ratings yet
Neural Network
97 pages
AI in Smart Grid (II)
No ratings yet
AI in Smart Grid (II)
15 pages
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
23 pages
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
22 pages
Graph Theory Report
No ratings yet
Graph Theory Report
9 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
UNIT-2 Machine Learning
No ratings yet
UNIT-2 Machine Learning
35 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Chapter 10: Artificial Neural Networks
No ratings yet
Chapter 10: Artificial Neural Networks
17 pages
NNlec 8
No ratings yet
NNlec 8
12 pages
Analysis of Multi Layer Perceptron Network
No ratings yet
Analysis of Multi Layer Perceptron Network
7 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
34 pages
Chapter 2 - Artificial Neural Networks
No ratings yet
Chapter 2 - Artificial Neural Networks
19 pages
22PCOAM16 - Machine Learning - Session 8 Multi Layer Perceptions
No ratings yet
22PCOAM16 - Machine Learning - Session 8 Multi Layer Perceptions
12 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
AI Week 12
No ratings yet
AI Week 12
2 pages
Percept Ron
No ratings yet
Percept Ron
13 pages
Artificial Neural Networks - MLP
No ratings yet
Artificial Neural Networks - MLP
52 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
NN Lab2
No ratings yet
NN Lab2
5 pages
4 Multilayer Perceptrons and Radial Basis Functions
No ratings yet
4 Multilayer Perceptrons and Radial Basis Functions
6 pages
Multilayer Perceptron and Uppercase Handwritten Characters Recognition
No ratings yet
Multilayer Perceptron and Uppercase Handwritten Characters Recognition
4 pages
Machine Learning For Transportation Research and Applications - Chapter 4
No ratings yet
Machine Learning For Transportation Research and Applications - Chapter 4
15 pages
Cp5191 MLT Unit II
No ratings yet
Cp5191 MLT Unit II
27 pages
ML Chapter 2
No ratings yet
ML Chapter 2
17 pages
Back Propagation
100% (1)
Back Propagation
27 pages
Clinical and Radiological Anatomy of The Lumbar Spine 5th Official Test Bank
No ratings yet
Clinical and Radiological Anatomy of The Lumbar Spine 5th Official Test Bank
408 pages
Poetry Analysis Worksheet
No ratings yet
Poetry Analysis Worksheet
2 pages
Teaching Portfolio Rubric
No ratings yet
Teaching Portfolio Rubric
5 pages
Case Analysis
No ratings yet
Case Analysis
4 pages
Dissertation Uni Erlangen Medizin
100% (2)
Dissertation Uni Erlangen Medizin
8 pages
DR Loai Saadah CV - Just
No ratings yet
DR Loai Saadah CV - Just
10 pages
Acknowledgement Sample in Term Paper
100% (1)
Acknowledgement Sample in Term Paper
7 pages
Easwari Engineering College: (Autonomous Institution)
No ratings yet
Easwari Engineering College: (Autonomous Institution)
84 pages
ĐỀ THI TUYỂN SINH VÀO LỚP 10 SỐ 2
No ratings yet
ĐỀ THI TUYỂN SINH VÀO LỚP 10 SỐ 2
5 pages
WLS Lifebook WEB
No ratings yet
WLS Lifebook WEB
63 pages
Colonial Colleges - Wikipedia
No ratings yet
Colonial Colleges - Wikipedia
55 pages
Comenius Report
No ratings yet
Comenius Report
5 pages
JT - SDT Final-Revised
No ratings yet
JT - SDT Final-Revised
13 pages
Audioscript
No ratings yet
Audioscript
2 pages
Final MeritList NotEligible
No ratings yet
Final MeritList NotEligible
51 pages
(2024) Educational Insights - Chatgpt's Impacts On Environmental Literacy
No ratings yet
(2024) Educational Insights - Chatgpt's Impacts On Environmental Literacy
14 pages
Culminating Recital
No ratings yet
Culminating Recital
3 pages
LEARNING PLAN - Health
No ratings yet
LEARNING PLAN - Health
7 pages
Hannah Gullet Resume
No ratings yet
Hannah Gullet Resume
2 pages
أهمية استخدام مقاييس الأداء غير المالية لزيادة فعالية دور المحاسبة الإدارية في ظل بيئة التصنيع الحديثة - رسالة ماجستير
No ratings yet
أهمية استخدام مقاييس الأداء غير المالية لزيادة فعالية دور المحاسبة الإدارية في ظل بيئة التصنيع الحديثة - رسالة ماجستير
8 pages
DOC-20240412-WA0014. - 1713077993257 - Rahul Mahadev Ugade
No ratings yet
DOC-20240412-WA0014. - 1713077993257 - Rahul Mahadev Ugade
2 pages
Mini Research Sociology
No ratings yet
Mini Research Sociology
10 pages
ML
No ratings yet
ML
1 page
Machine Learning Techniques For Civil Engineering Problems
No ratings yet
Machine Learning Techniques For Civil Engineering Problems
28 pages
P4C - Introductory Course Syllabus
No ratings yet
P4C - Introductory Course Syllabus
7 pages
EU Funding 2014-15
No ratings yet
EU Funding 2014-15
2 pages
Barangaycommunity Training Needs Assesssment
No ratings yet
Barangaycommunity Training Needs Assesssment
2 pages
PDF-Module 4 - WORKING IN GROUPS-DECSION MAKING & PROBLEM SOLVING
No ratings yet
PDF-Module 4 - WORKING IN GROUPS-DECSION MAKING & PROBLEM SOLVING
8 pages
EDLE2217: Syllabus Design & Material Preparation For Tesl
No ratings yet
EDLE2217: Syllabus Design & Material Preparation For Tesl
3 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet