Artificial Neural Networks and Deep Learning
Artificial Neural Networks and Deep Learning
1
Contents
• Introduction
Motivation, Biological Background
• Th res h o l d L o g i c U n i t s
Definition, Geometric Interpretation, Limitations, Networks of T L U s , Training
• General N e u r a l Networks
Structure, Operation, Training
• M u l t i - layer Perceptrons
Definition, Function Approximation, Gradient Descent, Backpropagation, Variants, Sensitivity Analysis
• Deep Learn i n g
Many-layered Perceptrons, Rectified Linear Units, Auto-Encoders, Feature Construction, Image Analysis
• R a d i a l B a s i s Fu n ct ion Networks
Definition, Function Approximation, Initialization, Training, Generalized Version
• Self-Org an i zi n g Map s
Definition, Learning Vector Quantization, Neighborhood of Output Neurons
2
Mathematical Background: Regression
96
Mathematical Background: L i n e a r Regression
97
Mathematical Background: L i n e a r
Regression
98
L i n e a r Regression: E x a m p l e
x 1 2 3 4 5 6 7 8
y 1 3 2 3 4 3 5 6
3 7
y= + x.
4 12
y
6
5
4
3
2
1
x
0
0 1 2 3 4 5 6 7 8
99
Mathematical Background: Po l y n o mi a l Regression
Generalization to polynomials
y = p(x) = a0 + a 1 x + . . . + a m x m
Necessary conditions for a minimum: All partial derivatives vanish, that is,
∂F ∂F ∂F
= 0, = 0, ... , = 0.
∂a0 ∂a1 ∂am
100
Mathematical Background: Po l y n o mi a l Regression
z = f (x, y) = a + bx + cy
Approach: Minimize the sum of squared errors, that is,
n n
F (a, b, c) = Σ ( f (xi , yi ) − z i )2 = Σ (a + bxi + cyi − z i) 2
i=1 i=1
Necessary conditions for a minimum: All partial derivatives vanish, that is,
∂F n
= Σ 2(a + bxi + cyi − z i) = 0,
∂a i=1
∂F n
= Σ 2(a + bxi + cyi − z i)x i = 0,
∂b i=1
n
∂F
= Σ 2(a + bxi + cyi − z i)y i = 0.
∂c i=1
102
Mathematical Background: Multilinear Regression
103
Mathematical Background: Logistic Regression
Y −y
ln = a + bx.
y
107
Logistic Regression: E x a m p l e
x 1 2 3 4 5
y 0.4 1.0 3.0 5.0 5.6
Transform the data with
Y −y
z = ln , Y = 6.
y
The transformed data points are
x 1 2 3 4 5
z 2.64 1.61 0.00 −1.61 −2.64
The resulting regression line and therefore the desired function are
6
z ≈ −1.3775x + 4.133 and y≈ .
1 + e−1.3775x+4.133
Attention: Note that the error is minimized only in the transformed space!
Therefore the function in the original space may not be optimal!
108
Logistic Regression: E x a m p l e
y Y = 6
4 6
3
5
2
1 4
x
0 3
−1 1 2 3 4 5
2
−2
1
−3 x
−4 0
0 1 2 3 4 5
109
Tr ain in g M u l t i - layer Perceptrons
113
T r a i n i n g M u l t i - layer Perceptrons: G r a d i e n t Descent
114
G r a d i e n t Descent: Fo r ma l A p p r o a c h
General Idea: Approach the minimum of the error function in small steps.
Error function:
115
G r a d i e n t Descent: Fo r ma l A p p r o a c h
115
G r a d i e n t Descent: Fo r ma l A p p r o a c h
Single pattern error depends on weights only through the network input:
For the first factor we consider the error e (l) for the training pattern l = (→
ı (l) , →
o (l) ):
that is, the sum of the errors over all output neurons.
116
G r a d i e n t Descent: Fo r ma l A p p r o a c h
Therefore we have
117
G r a d i e n t Descent: Fo r ma l A p p r o a c h
118
G r a d i e n t Descent: Fo r ma l A p p r o a c h
Exact formulae depend on the choice of the activation and the output function,
since it is
(l) (l)
out(l)
u = f out ( actu ) = f out (f act ( netu )).
119
G r a d i e n t Descent: Fo r ma l A p p r o a c h
and therefore
120
G r a d i e n t Descent: Fo r ma l A p p r o a c h
1 1
1 1
2 2
1
4
net net
0 0
−4 −2 0 +2 +4 −4 −2 0 +2 +4
• If a logistic activation function is used (shown on left), the weight changes are
proportional to λ (l)
u
= outu(l)( 1 − out u(l)) (shown on right; see preceding slide).
• Weight changes are largest, and thus the training speed highest, in the vicinity
of netu(l) = 0. Far away from netu(l) = 0, the gradient becomes (very) small
(“saturation regions”) and thus training (very) slow.
121
E r r o r Backpropagation
Consider now: The neuron u is a hidden neuron, that is, u ∈ U k , 0 < k < r − 1.
(l) (l)
The output outv of an output neuron v depends on the network input netu
only indirectly through the successor neurons succ(u) = { s ∈ U | (u, s) ∈ C }
= {s 1 , . . . , s m } ⊆ U k+1 , namely through their network inputs nets(l).
123
E r r o r Backpropagation
124
E r r o r Backpropagation
125
E r r o r Backpropagation: C o o k b o o k Recipe
126
G r a d i e n t Descent: E x a mp l e s
x y
w y
x θ 0 1
1 0
Note: error for x = 0 and x = 1 is effectively the squared logistic activation function!
127
G r a d i e n t Descent: E x a mp l e s
128
G r a d i e n t Descent: E x a mp l e s
4 4
2 2
w 0 w 0
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
Online Training Batch Training Batch Training
129
G r a d i e n t Descent: E x a mp l e s
5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6
i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.011 5
2 0.222 2.874 −10.490 0.010
3 0.232 2.766 −10.182 0.010 4
4 0.243 2.664 −9.888 0.010
−9.606 3
5 0.253 2.568 0.010
6 0.262 2.477 −9.335 0.009 2
7 0.271 2.391 −9.075 0.009
8 0.281 2.309 −8.825 0.009 1
9 0.289 2.233 −8.585 0.009 x
10 0.298 2.160 0
0 1 2 3 4
Gradient descent with initial value 0.2 and learning rate 0.001.
130
G r a d i e n t Descent: E x a mp l e s
5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6
i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −0.875
1 0.625 0.655 −1.431 0.358 5
2 0.983 0.955 2.554 −0.639
3 0.344 1.801 −7.157 1.789 4
4 2.134 4.127 0.567 −0.142
1.380 −0.345
3
5 1.992 3.989
6 1.647 3.203 3.063 −0.766 2 starting point
7 0.881 0.734 1.753 −0.438
8 0.443 1.211 −4.851 1.213 1
9 1.656 3.231 3.029 −0.757 x
10 0.898 0.766 0
0 1 2 3 4
Gradient descent with initial value 1.5 and learning rate 0.25.
131
G r a d i e n t Descent: E x a mp l e s
5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6
i xi f (x i ) f ′ (x i ) ∆xi
6
0 2.600 3.816 −1.707 0.085
1 2.685 3.660 −1.947 0.097 5
2 2.783 3.461 −2.116 0.106
3 2.888 3.233 −2.153 0.108 4
4 2.996 3.008 −2.009 0.100
−1.688 3
5 3.097 2.820 0.084
6 3.181 2.695 −1.263 0.063 2
7 3.244 2.628 −0.845 0.042
8 3.286 2.599 −0.515 0.026 1
9 3.312 2.589 −0.293 0.015 x
10 3.327 2.585 0
0 1 2 3 4
Gradient descent with initial value 2.6 and learning rate 0.05.
132
G r a d i e n t Descent: Variants
S t a n d a r d backpropagation:
η
∆w(t) = − ∇we(t)
2
M a n h at t a n training:
∆w(t) = −η sgn(∇we(t)).
M o me n t u m term:
η
∆w(t) = − ∇we(t) + β ∆w(t − 1),
2
Part of previous change is added, may lead to accelerated training (β ∈ [0.5, 0.95]).
133
G r a d i e n t Descent: Variants
134
G r a d i e n t Descent: Variants
Quick propagation e
e(t−1)
apex
e(t)
w
m w(t +1 ) w(t) w(t−1)
∇w e ∇w e(t−1)
∇w e(t)
135
G r a d i e n t Descent: E x a mp l e s
136
G r a d i e n t Descent: E x a mp l e s
4 4
2 2
w 0 w 0
−2 −2
−4 −4
−4 −2 0 2 4 −4 −2 0 2 4
θ θ
without momentum term with momentum term with momentum term
137
G r a d i e n t Descent: E x a mp l e s
5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6
i xi f (x i ) f ′ (x i ) ∆xi
6
0 0.200 3.112 −11.147 0.011
1 0.211 2.990 −10.811 0.021 5
2 0.232 2.771 −10.196 0.029
3 0.261 2.488 −9.368 0.035 4
4 0.296 2.173 −8.397 0.040
3
5 0.337 1.856 −7.348 0.044
6 0.380 1.559 −6.277 0.046 2
7 0.426 1.298 −5.228 0.046
8 0.472 1.079 −4.235 0.046 1
9 0.518 0.907 −3.319 0.045 x
10 0.562 0.777 0
0 1 2 3 4
138
G r a d i e n t Descent: E x a mp l e s
5 115 2
Example function: f (x) = x 4 − 7x 3 + x − 18x + 6,
6 6
i xi f (x i ) f ′ (x i ) ∆xi
6
0 1.500 2.719 3.500 −1.050
1 0.450 1.178 −4.699 0.705 5
2 1.155 1.476 3.396 −0.509
3 0.645 0.629 −1.110 0.083 4
4 0.729 0.587 0.072 −0.005
3
5 0.723 0.587 0.001 0.000
6 0.723 0.587 0.000 0.000 2
7 0.723 0.587 0.000 0.000
8 0.723 0.587 0.000 0.000 1
9 0.723 0.587 0.000 0.000 x
10 0.723 0.587 0
0 1 2 3 4
Gradient descent with initial value 1.5, inital learning rate 0.25,
and self-adapting learning rate (c + = 1.2, c− = 0.5).
139
O t h e r Extensions of E r r o r Backpropagation
F l a t S p o t Elimination:
η
∆w(t) = − ∇we(t) + ζ
2
• Eliminates slow learning in saturation region of logistic function (ζ ≈ 0.1).
• Counteracts the decay of the error signals over the layers.
Weight Decay :
η
∆w(t) = − ∇we(t) − ξ w(t),
2
140
N u m b e r of H i d d e n Neurons
• Note that the approximation theorem only states that there exists
a number of hidden neurons and weight vectors → →i and thresholds θi,
v and w
but not how they are to be chosen for a given ε of approximation accuracy.
• For a single hidden layer the following rule of t h u mb is popular:
number of hidden neurons = (number of inputs + number of outputs) / 2
141
N u m b e r of H i d d e n Neurons
• Overfitting will usually lead to the effect that the error a multi-layer perceptron
yields on the validation data will be (possibly considerably) greater than the error
it yields on the training data.
The reason is that the validation data set is likely distorted in a different fashion
than the training data, since the errors and deviations are random.
• Minimizing the error on the validation data by properly choosing the number of
hidden neurons prevents both under- and overfitting.
142
N u m b e r of H i d d e n Neurons: A v o i d Overfitting
y
7 black line:
6 regression line
5 (2 free parameters)
4
3 blue curve:
2 7th order regression polynomial
1 (8 free parameters)
x
0
0 1 2 3 4 5 6 7 8
• The blue curve fits the data points perfectly, b u t it is not a good model.
143
N u m b e r of H i d d e n Neurons: C r o s s Validation
• The described method of iteratively splitting the data into training and validation
data may be referred to as cross validation.
• However, this term is more often used for the following specific procedure:
◦ The given data set is split into n parts or subsets (also called folds)
of about equal size (so-called n-fold cross validation).
◦ If the output is nominal (also sometimes called symbolic or categorical), this
split is done in such a way that the relative frequency of the output values in
the subsets/folds represent as well as possible the relative frequencies of these
values in the data set as a whole. This is also called stratification (derived
from the Latin stratum: layer, level, tier).
◦ Out of these n data subsets (or folds) n pairs of training and validation data
set are formed by using one fold as a validation data set while the remaining
n − 1 folds are combined into a training data set.
144
N u m b e r of H i d d e n Neurons: C r o s s Validation
• The advantage of the cross validation method is that one random split of the data
yields n different pairs of training and validation data set.
• An obvious disadvantage is that (except for n = 2) the size of the training and the
test data set are considerably different, which makes the results on the validation
data statistically less reliable.
• Repeating the split (either with n = 2 or greater n) has the advantage that one
obtains many more training and validation data sets, leading to more reliable
statistics (here: for the number of hidden neurons).
145
A v o i d i n g Overfitting: Alternatives
• Furthermore a stopping criterion may be derived from the shape of the error curve
on the training data over the training epochs, or the network is trained only for a
fixed, relatively small number of epochs (also known as early stopping).
• Disadvantage: these methods stop the training of a complex network early enough,
rather than adjust the complexity of the network to the “correct” level.
146
Sensitivity A n aly s is
147
Sensitivity A n a l y si s
P r o b l e m of M u l t i - Lay er Perceptrons:
• The knowledge that is learned by a neural network
is encoded in matrices/vectors of real-valued numbers and
is therefore often difficult to understand or to extract.
• This may also give hints which inputs are not needed and may be discarded.
148
Sensitivity A n a l y si s
1 Σ Σ (l
∂ outv)
∀u ∈ Uin : s(u) = .
|Lfixed| l∈L (l)
fixed v∈Uout ∂ ext u
149
Sensitivity A n a l y si s
∂ netv ∂ Σ Σ ∂ outp
= wvp outp = wvp .
∂ outu ∂ outu p∈pred(v) p∈pred(v)
∂ outu
150
Sensitivity A n a l y si s
151
Sensitivity A n a l y si s
attribute ξ= 0 ξ = 0.0001
sepal length 0.0216 0.0399 0.0232 0.0515 0.0367 0.0325 0.0351 0.0395
sepal width 0.0423 0.0341 0.0460 0.0447 0.0385 0.0376 0.0421 0.0425
petal length 0.1789 0.2569 0.1974 0.2805 0.2048 0.1928 0.1838 0.1861
petal width 0.2017 0.1356 0.2198 0.1325 0.2020 0.1962 0.1750 0.1743
152
Demonstration Software: x m l p / w m l p
166
M u l t i - Layer Perceptron Software: ml p / ml p g u i
167