Lec3 Learning
Lec3 Learning
1
Topics for the day
• The problem of learning
• The perceptron rule for perceptrons
– And its inapplicability to multi-layer perceptrons
• Greedy solutions for classification networks:
ADALINE and MADALINE
• Learning through Empirical Risk Minimization
• Intro to function optimization and gradient
descent
2
Recap
3
These boxes are functions
Voice Image
N.Net Transcription N.Net Text caption
signal
Game
N.Net Next move
State
• Take an input
• Produce an output
• Can be modeled by a neural network!
4
Questions
Something Something
N.Net out
in
• Preliminaries:
– How do we represent the input?
– How do we represent the output?
• How do we compose the network that performs
the requisite function?
5
Questions
Something Something
N.Net out
in
• Preliminaries:
– How do we represent the input?
– How do we represent the output?
• How do we compose the network that performs
the requisite function?
6
The original perceptron
.. +
.
• Perceptron
– General setting, inputs are real valued
– A bias representing a threshold to trigger the perceptron
– Activation functions are not necessarily threshold functions
• The parameters of the perceptron (which determine how it behaves) are
its weights and bias 8
Preliminaries: Redrawing the neuron
.. +
.
1
• Given: the architecture of the network
• The parameters of the network: The weights and biases
– The weights associated with the blue arrows in the picture
• Learning the network : Determining the values of these
parameters such that the network computes the desired function
11
• Moving on..
12
The MLP can represent anything
13
Option 1: Construct by hand
0,1
-1,0 1,0
0,-1
X1 X2
-1,0 1,0
X1
0,-1
X1 X2
-1,0 1,0
X1
0,-1
-1,0 1,0
X1
1
-1 1
0,-1
X1 X2
-1,0 1,0
X1
1
1 1
0,-1
X1 X2
X1 X2 X1 X2
-1,0 1,0
X1
1
1 1 1
-1 1
0,-1
X1 X2
-4 X1 X2
1 1 1
1
1 1 1 1
1 -1 -1 -1 1
1 1 -1
Assuming simple perceptrons:
19
output = 1 if X1 X2
Option 1: Construct by hand
0,1
-1,0 1,0
0,-1
22
Problem is unknown
di
Xi
• Sample
– Basically, get input-output pairs for a number of samples of input
• Many samples (𝑋 , 𝑑 ), where 𝑑 = 𝑔 𝑋 + 𝑛𝑜𝑖𝑠𝑒
di
Xi
di
Xi
27
Poll 1
28
Poll 1
• Since neural networks are universal approximators, any network of any
architecture can approximate any function to arbitrary precision (True or
False):
– True
– False
30
History: The original MLP
.. +
.
• The original MLP as proposed by Minsky: a
network of threshold units
– But how do you train it?
• Given only “training” instances of input-output pairs
31
The simplest MLP: a single perceptron
x2 1 x2
x1
0 x1
x2
x2
x1
x1
x2 .. +
x1
.
• Given a number of input output pairs, learn the weights and bias
Boundary:
–
x3
xN
WN+1
xN+1=1
• Restating the perceptron equation by adding another dimension to
where
• Find the hyperplane that perfectly separates the two groups of points
– Let vector 𝑊 = 𝑤 , 𝑤 , … , 𝑤 and vector 𝑋 = 𝑥 , 𝑥 , … , 𝑥 , 1
– ∑ 𝑤 𝑋 = 𝑊 𝑋 is an inner product
– 𝑊 𝑋 = 0 is the hyperplane comprising all 𝑋s orthogonal to vector 𝑊
• Learning the perceptron = finding the weight vector 𝑊 for the separating hyperplane
• 𝑊 points in the direction of the positive class 37
The Perceptron Problem
Key: Red 1, Blue = -1
39
Perceptron Algorithm: Summary
40
Perceptron Learning Algorithm
• Given training instances
– or
Using a +1/-1 representation
• Initialize for classes to simplify
notation
• Cycle through the training instances:
• do
– For 𝑡𝑟𝑎𝑖𝑛
• If 𝑂(𝑋 ) ≠ 𝑦
41
A Simple Method: The Perceptron
Algorithm
-1 (blue)
+1(Red)
Initialization
-1 (blue)
+1(Red)
43
Perceptron Algorithm
-1 (blue)
+1(Red)
44
Perceptron Algorithm
-1 (blue)
+1(Red)
45
Perceptron Algorithm
-1 (blue)
+1(Red)
46
Perceptron Algorithm
-1 (blue)
+1(Red)
47
Perceptron Algorithm
-1 (blue)
+1(Red)
48
Perceptron Algorithm
-1 (blue)
+1(Red)
49
Perceptron Algorithm
-1 (blue)
+1(Red)
50
Perceptron Algorithm
+1(Red)
-1 (blue)
The new decision boundary
Perfect classification, no more updates, we are done
51
Perceptron Algorithm
+1(Red)
-1 (blue)
The new decision boundary
Perfect classification, no more updates, we are done
If the classes are linearly separable, guaranteed to
converge in a finite number of steps 52
Convergence of Perceptron Algorithm
• Guaranteed to converge if classes are linearly
separable
-1(Red)
+1 (blue)
• When classes are not linearly separable, not possible to find a separating hyperplane
– No “support” plane for reflected data
– Some points will always lie on the other side
• Model does not support perfect classification of this data
55
A simpler solution
Key: Red 1, Blue = -1
• Learning the perceptron: Find a plane such that all the modified
( ) features lie on one side of the plane
– Such a plane can always be found if the classes are linearly separable
57
The Perceptron Solution:
Linearly separable case
Key: Red 1, Blue = -1
• When classes are not linearly separable, not possible to find a separating hyperplane
– No “support” plane for reflected data
– Some points will always lie on the other side
• Model does not support perfect classification of this data
59
History: A more complex problem
x2
x2
x1 x2
x1
• Even using the perfect architecture…
• … can we use perceptron learning rules to learn
this classification function?
61
The pattern to be learned at the
lower level
x2
x1 x2
x1
x2
x1 x2
x1
• Consider a single linear classifier that must be learned from the training data
– Can it be learned from this data?
–
•
63
Poll 2
64
Poll 2
• For the double-pentagon problem, given the data shown on slide 60 and
given that all but the one neuron highlighted in yellow are already
correctly learned, can we use the perceptron learning algorithm to learn
the one remaining neuron?
– Yes
– No
• What problems do you see in using the perceptron rule to learn the
remaining perceptron?
– Perceptron learning will require linearly separable classes to learn the model
that classifies the data perfectly, but the data are not linearly separable
– Perceptron learning will require relabelling the data to make them linearly
separable with the correct decision boundary
65
The pattern to be learned at the
lower level
x2
x1 x2
x1
• Consider a single linear classifier that must be learned from the training data
– Can it be learned from this data?
– The individual classifier actually requires the kind of labelling shown here
• Which is not given!!
66
The pattern to be learned at the
lower level
x2
x1 x2
x1
x2
x1 x2
x1
x2
x1 x2
x1
• This must be done for each of the lines (perceptrons)
• Such that, when all of them are combined by the higher-
level perceptrons we get the desired pattern
– Basically an exponential search over inputs 69
Individual neurons represent one of the lines Must know the desired output of every
that compose the figure (linear classifiers)
neuron for every training instance, in
order to learn this neuron
The outputs should be such that the
neuron individually has a linearly
separable task
The linear separators must combine to
form the desired boundary
x2
x1
x1 x2
• Training this network using the perceptron rule is a combinatorial optimization
problem
• We don’t know the outputs of the individual intermediate neurons in the network
for any training input
• Must also determine the correct output for each neuron for every training
instance
• At least exponential (in inputs) time complexity!!!!!!
71
Greedy algorithms: Adaline and
Madaline
• Perceptron learning rules cannot directly be
used to learn an MLP
– Exponential complexity of assigning intermediate
labels
• Even worse when classes are not actually separable
72
A little bit of History: Widrow
Bernie Widrow
• Scientist, Professor, Entrepreneur
• Inventor of most useful things in
signal processing and machine
learning!
75
History: Learning in ADALINE
76
The ADALINE learning rule
• Online learning rule
• After each input , that has
target (binary) output , compute
and update:
77
The Delta Rule
𝑑
𝑑 𝑦
• For both 𝑥 1
78
Aside: Generalized delta rule
• For any differentiable activation function
the following update rule is used
𝒇(𝒛)
• Multiple Adaline
– A multilayer perceptron with threshold activations
– The MADALINE
80
MADALINE Training
-
+ +
+
+ +
81
MADALINE Training
+ +
+
+ +
86
Story so far
• “Learning” a network = learning the weights and biases to compute a target function
– Will require a network with sufficient “capacity”
• In practice, we learn networks by “fitting” them to match the input-output relation of
“training” instances drawn from the target function
87
History..
• The realization that training an entire MLP was
a combinatorial optimization problem stalled
development of neural networks for well over
a decade!
88
Why this problem?
x2
x1 x2
.. +
.
• Lets make the neuron differentiable, with non-zero derivatives over
much of the input space
– Small changes in weight can result in non-negligible changes in output
– This enables us to estimate the parameters using gradient descent
techniques..
92
Differentiable activation function
y y
T1 x T2 x
• Threshold activation: shifting the threshold from T1 to T2 does not change
classification error
– Does not indicate if moving the threshold left was good or not
0.5 0.5
T1 T2
• Smooth, continuously varying activation: Classification based on whether the
output is greater than 0.5 or less
– Can now quantify how much the output differs from the desired target value (0 or 1)
– Moving the function left or right changes this quantity, even if the classification error itself
93
doesn’t change
Poll 3
94
Poll 3
• Which of the following are true of the threshold activation
– Increasing (or decreasing) the threshold will not change the overall classification error unless
the threshold moves past a misclassified training sample
– We cannot know if a change (increase of decrease) of the threshold moves it in the correct
direction that will result in a net decrease in classification error
– The derivative of the classification error with respect to the threshold gives us an indication of
whether to increase or decrease the threshold
95
Continuous Activations
.. +
.
• Replace the threshold activation with continuous graded activations
– E.g. RELU, softplus, sigmoid etc.
.. +
.
x2
x1
98
• Two-dimensional example
– Blue dots (on the floor) on the “red” side
– Red dots (suspended at Y=1) on the “blue” side
– No line will cleanly separate the two colors 98
Non-linearly separable data: 1-D example
y
y=1
y=0
x
x2
x1
115
Perceptrons with differentiable
activation functions
.. +
.
• is a differentiable function of
– is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊 and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights 116
Overall network is differentiable
Figure does not show
bias connections
120
Minimizing expected divergence
121
Recap: Sampling the function
di
Xi
di
Xi
• The expected divergence (or risk) is the average divergence over the entire input space
• The empirical estimate of the expected risk is the average divergence over the samples
123
Empirical Risk Minimization
– I.e. minimize the empirical risk over the drawn samples 124
Empirical Risk Minimization
– I.e. minimize the empirical error over the drawn samples 125
ERM for neural networks
Actual output of network:
w.r.t
• We define differentiable divergence between the output of the network and the
desired output for the training instances
– And a total error, which is the average divergence over all training instances
• We optimize network parameters to minimize this error
– Empirical risk minimization
• This is an instance of function minimization
128
• A CRASH COURSE ON FUNCTION
OPTIMIZATION
– With an initial discussion of derivatives
129
A brief note on derivatives..
derivative
Derivative:
Often represented as
Or alternately (and more reasonably) as
131
Scalar function of scalar argument
0
0 --
+ -- +
+
--
+ - +
--
+ -- +
+
- -
- +
0
It will be 0 where the function is locally flat (neither increasing nor decreasing)
132
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
133
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector
–
• The gradient is the transpose of the derivative
– A column vector of the same dimensionality as
135
Gradient of a scalar function of a vector
–
• The gradient is the transpose of the derivative
This is– aAvector inner product. To understand its behavior lets
column vector of the same dimensionality as
consider a well-known property of inner products 136
A well-known vector property
Blue arrow
is
•
• For an increment of any given length is max if
is aligned with T
Gradient
vector 𝑇
139
Gradient
Gradient
vector 𝑇
Moving in this
direction increases
fastest
140
Gradient
Gradient
vector 𝑇
Moving in this
𝑇
direction increases
Moving in this fastest
direction decreases
fastest
141
Gradient
Gradient here
is 0
Gradient here
is 0
142
Properties of Gradient: 2
2 f 2 f 2 f
. .
21x
2
x1x2 x1xn
f 2 f 2 f
x x . .
X f ( x1 ,..., xn ) : 2 1
2 x2
2
x2 xn
. . . . .
. . . . .
2 f f
2
2 f
. .
xn x1 xn x2 2
xn
144
Next up
• Continuing on function optimization
145
Poll 4
146
Poll 4
• Select all that are true about derivatives of a
scalar function f(X) of multivariate inputs
– At any location X, there may be many directions
in which we can step, such that f(X) increases
– The direction of the gradient is the direction in
which the function increases fastest
– The gradient is the derivative of f(X) w.r.t. X
147