Multilayer Percept Ron
Multilayer Percept Ron
In nonlinear regression the output variable y is no longer a linear function of the regression
parameters plus additive noise. This means that estimation of the parameters is harder. It
does not reduce to minimizing a convex energy functions – unlike the methods we described
earlier.
The perceptron is an analogy to the neural networks in the brain (over-simplified). It
receives a set of inputs y = dj=1 ωj xj + ω0 , see Figure (3).
P
It has a threshold function which can be hard or soft. The hard one is ζ(a) = 1, if
T
a > 0, ζ(a) = 0, otherwise. The soft one is y = σ(~ ω T ~x) = 1/(1 + eω~ ~x ), where σ(·) is the
sigmoid function.
There are a variety of different algorithms to train a perceptron from labeled examples.
Example: The quadratic error:
E(~ω |~xt , y t ) = 12 (y t − ω
~ · ~xt )2 ,
∂E
for which the update rule is ∆ωjt = −∆ ∂ω j
= +∆(y t ω~ · ~xt )~xt . Introducing the sigmoid
function rt = sigmoid(~ T t we have
P tω ~x ),
E(~ω |~x , y ) = − i ri log yit + (1 − rit ) log(1 − yit ) , and the update rule is
t t
∆ωjt = −η(rt − y t )xtj , where η is the learning factor. I.e, the update rule is the learning
factor × (desired output – actual output)× input.
10
4.1 Multilayer Perceptrons
Multilayer perceptrons were developed to address the limitations of perceptrons (introduced
in subsection 2.1) – i.e. you can only perform a limited set of classification problems, or
regression problems, using a single perceptron. But you can do far more with multiple
layers where the outputs of the perceptrons at the first layer are input to perceptrons at
the second layer, and so on.
Two ingredients: (I) A standard perceptron has a discrete outcome, sign(~ ω ·~x) ∈ {±1}.
− dj=1 (ωhj ·xj +ω0j )
P
ωh ·~x) = 1/{1 + e
It is replaced by a graded, or soft, output zh = σ(~ }, with
h = 1..H. See figure (4). This makes the output a differentiable function of the weights ω ~.
(II) Introduce hidden units, or equivalently, multiple layers, see figure (5).
The output is
H
X
T
yi = ~νi z = = νhi zh + ν0i .
h
Figure 5: A multi-layer perceptron with input x’s, hidden units z’s, and outputs y’s.
11
Many levels can be specified. What do the hidden units represent? Many people have
tried to explain them but it is unclear. The number of hidden units is related to the
capacity of the perceptron. Any input-output function can be represented as a multilayer
perceptron with enough hidden units.
12
4.2 Training Multilayer Perceptrons
For training a multilayer perceptron we have to estimate the weights ωhj , νij of the per-
ceptron. First we need an error function. It can be defined as:
X X X
E[ω, ν] = {yi − νih σ( ωhj xj )}2
i h j
The update terms are the derivatives of the error function with respect to the param-
eters:
∂E
∆ωhj = − ,
∂ωhj
which is computed by the chain rule, and
∂E
∆νih = − ,
∂νih
which is computed directly.
By defining rk = σ( j ωkj xj ), E = j (yi − k νik rk )2 , we can write
P P P
∂E X ∂E ∂rk
= · ,
∂ωkj r
∂rk ∂ωkj
where
∂E X X
= −2 (yi − νil rl )νik ,
∂rk
j l
∂rk X
= xj σ 0 ( ωkj xj ),
∂ωkj
j
d
σ 0 (z) = σ(z) = σ(z){1 − σ(z)}.
dz
Hence,
∂E X X X X
= −2 (yi − νil rl )νik xk σ( ωkj xj ){1 − σ( ωkj xj )},
∂ωhj
j l j j
P P
where j (yi − l νil rl ) is the error at the output layer, νik is the weight k from middle
layer to output layer.
This is called backpropagation
P The
P error at the output layer is propagated back to the
nodes at the middle layer j (yi − l νil rl ) where it is multiplied by the activity rk (1 − rk )
at that node, and by the activity xj at the input.
13
4.2.1 Variants
One variant is learning in batch mode, which consists in putting all data into an energy
function – i.e., to sum the errors over all the training data. The weights are updated
according to the equations above, by summing over all the data.
Another variant is to do online learning. In this variant, at each time step you select
an example (xt , y t ) at random from a dataset, or from some source that keeps inputting
exmaples, and perform one iteration of steepest descent using only that datapoint. I.e. in
the update equations remove the summation over t. Then you select another datapoint at
random, do another iteration of steepest descent, and so on. This variant is suitable for
problems in which we keep on getting new input over time.
This is called stochastic descent (or Robins-Monroe) and has some nice properties
including better convergence than the batch method described above. This is because
selecting the datapoints at random introduces an element of stochasticity which prevents
the algorithm from getting stuck in a local minimum (although the theorems for this require
multiplying the update – the gradiant – by a terms that decreases slowly over time).
14
4.3 Critical issues
One big issue is the number of hidden units. This is the main design choice since the
number of input and output units is determined by the problem.
Too many hidden units means that the model will have too many parameters – the
weights ω, ν – and so will fail to generalize if there is not enough training data. Conversely,
too few hidden units means restricts the class of input-output functions that the multilayer
perceptron can represent, and hence prevents it from modeling the data correctly. This is
the classic bias-variance dilemma (previous lecture).
A popular strategy is to have a large number of hidden units but to add a regularizer
term that penalizes the strength of the weights, This can be done by adding an additional
energy term: X X
2 2
λ ωhj + νih
j,j i,h
This term encourages the weights to be small and maybe even to be zero, unless the
data says otherwise. Using an L1 -norm penalty term is even better for this.
Still, the number of hidden units is a question and in practice some of the most effective
multilayer perceptrons are those in which the structure was hand designed (by trial and
error).
15
4.4 Relation to Support Vector Machines
P
In a P
perceptron
P we get yi = h νih zh at the output and at the hidden layer we get
zh = j σ( h ωhj xj ) from the input layer.
Support Vector Machines (SVM) can also be represented in this way.
X
y = sign( αµ yµ ~xµ · ~x),
µ
P
with ~xµ · ~x = zµ the hidden units response, i.e, y = sign( µ αµ yµ zµ ).
An advantage of SVM is that the number of hidden units is given by the number of
support vectors. {αµ } is specified by minimizing the primal problem, and there is a well
defined algorithm to perform this minimization.
16