Lecture 16-Multilayer Perceptron
Lecture 16-Multilayer Perceptron
Lecture 16
Multilayer Perceptron
Course: DSAI 512-Machine Learning
1
Instructor: Ercan Atam
List of contents for this lecture
❖ Multiple layers
❖ Universal approximation
2
Relevant readings for this lecture
➢ e-Chapter 7 of Yaser S. Abu-Mostafa, Malik Magdon-Ismail, Hsuan-Tien Lin, “Learning from Data",
AMLBook, 2012.
➢ Chapter 6 (6.1-6.2) of Andreas Lindholm, Niklas Wahlstrom, Fredrik Lindsten, Thomas B. Schon,
“Machine Learning: A First Course for Engineers and Scientists”, Cambridge University Press, 2022.
3
The Neural network - biologically inspired
4
Planes don’t flap wings to fly
Engineering success may start with biological inspiration, but then takes a totally different path...
5
XOR: a limitation of the linear model (1)
6
XOR: a limitation of the linear model (2)
𝑓(𝐱)
Why?
7
Decomposing XOR
AND → multiplication
OR → addition
Negation → bar
8
Perceptrons for OR and AND
𝑢1 AND(𝑢1 , 𝑢2 )
𝑢1 OR(𝑢1 , 𝑢2 )
𝑢2
𝑢2
9
How did we find that 𝑓 = ℎ1 ℎത 2 + ℎത1 ℎ2 ?
We consider only the regions of 𝑓 which are “+” and use the “disjunctive normal form” (= OR of ANDs):
Note: You can check that the decomposition constructed based on considering only the positive regions of 𝑓
holds as well when the negative regions of 𝑓 are also considered.
10
Representing 𝑓 using OR and AND (1)
Step1 (“OR”):
11
Representing 𝑓 using OR and AND (2)
12
Representing 𝑓 using OR and AND (3)
13
The multilayer perceptron (MLP)
MLP:
P:
15
Universal approximation (1)
Any target function 𝑓 that can be decomposed into linear separators can be implemented by a 3-layer
perceptron.
16
Universal approximation (2)
If 𝑓 is not strictly decomposable into perceptrons, but has a smooth decision boundary, then a 3-layer
perceptron can come arbitrarily close to implementing it.
Pictorial proof:
17
Approximation versus generalization
18
Minimizing 𝐸in for MLPs
❑ 𝐸in is not smooth (due to “sign” function), so we cannot use the gradient descent.
19
The neural network
20
Zooming into a hidden node
(𝑙)
𝑤𝑖𝑗 : the weight into node 𝑗 in layer 𝑙
from node 𝑖 in the previous layer.
𝑊 (𝑙) 𝑊 (𝑙+1)
𝒙(𝑙−1) 𝒔(𝑙) 𝒙(𝑙) 𝒔(𝑙+1)
𝜃 𝜃 𝜃
layer (l-1) layer (l) layer (l+1)
❑ Regression: replace 𝜃(𝑠) in the output node with identity transformation (=no transformation).
❑ Logistic regression: replace 𝜃(𝑠) in the output node with logistic regression sigmoid.
22
Summary
23
References
(utilized for preparation of lecture notes or Matlab code)
▪ https://amlbook.com/eChapters/6-Oct2022-readeronly.pdf
▪ https://www.cs.rpi.edu/~magdon/courses/LFD-Slides/SlidesLect20.pdf
24