L2 Perceptrons, Function Approximation, Classification
L2 Perceptrons, Function Approximation, Classification
Learning
Perceptrons, Feed-Forward Nets and
partition function specifications
PERCEPTRONS
A better model
• Frank Rosenblatt
– Psychologist, Logician
– Inventor of the solution to everything, aka the Perceptron (1958)
69
Rosenblatt’s perceptron
71
Rosenblatt’s perceptron
.
i i
3 i
3
. +
N
N
• A threshold unit
– “Fires” if the affine function of inputs is positive
• The bias is the negative of the threshold T in the previous
slide
10
The standard paradigm for The standard Perceptron
statistical pattern recognition architecture
input units
How to learn biases using the same rule
as we use for learning weights
Sequential Learning:
is the desired output in response to input
is the actual output in response to
• Boolean tasks
• Update the weights whenever the perceptron output is
wrong
– Update the weight by the product of the input and the
error between the desired and actual outputs
• Proved convergence for linearly separable classes
Perceptron
X 1
-1
2 X 0
1
Y
X 1
1
1
X ?
?
?
Y
A single neuron is not enough
1
-1 1
2
1
1
-1
-1
Y
Hidden Layer
• XOR
– The first layer is a “hidden” layer
A more generic model
2
1 1
0 1
1 -1 1 1
2 2 1 2
1 1 1 -1 1 -1
1 1 1
X Y Z A
• A “multi-layer” perceptron
• Can compose arbitrarily complicated Boolean functions!
– In cognitive terms: Can compute arbitrary Boolean functions over
sensory input
– More on this in the next class
80
Neuron model: Logistic unit
Credit: Andrew Ng
Neural Network
x2
x2
x1
x1
Simple example: AND 1.0
0 0
0 1
1 0
1 1
Example: OR function
-10
20 0 0
20 0 1
1 0
1 1
Negation:
0
1
Putting it together:
-30 10 -10
20 -20 20
20 -20 20
0 0
0 1
1 0
1 1
Capabilities of Threshold Neurons
•What do we do if we need a more complex function?
•Just like Threshold Logic Units, we can also combine
multiple artificial neurons to form networks with increased
capabilities.
•For example, we can build a two-layer network with any
number of neurons in the first layer giving input to a single
neuron in the second layer.
•The neuron in the second layer could, for example,
implement an AND function.
Capabilities of Threshold Neurons
x1
x2
x1
x2 xi
.
.
.
x1
x2
2nd comp.
1st comp.
x2
x1
x2 oi
. .
. .
. .
x1
x2
2nd comp.
1st comp.
• Add an extra component with value 1 to each input vector. The “bias” weight on this
component is minus the threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that every training case will keep
getting picked.
– If the output unit is correct, leave its weights alone.
– If the output unit incorrectly outputs a zero, add the input vector to the weight
vector.
– If the output unit incorrectly outputs a 1, subtract the input vector from the
weight vector.
• This is guaranteed to find a set of weights that gets the right answer for all the
training cases if any such set exists.
FUNCTION APPROXIMATION,
CLASSIFICATION….
Y
X 1
1
1
Values in the circles are thresholds
Y Values on edges are weights
1
L
-1
-1
1
L-N+1
-1
-1
-1
30
The perceptron is not enough
X ?
?
?
1
-1 1
2
1
1
-1
-1
Y
Hidden Layer
1
-2
2 1
1
• With 2 neurons
– 5 weights and two thresholds
Multi-layer perceptron
2
1 1
0 1
1 -1 1 1
2 2 1 2
1 1 1 -1 1 -1
1 1 1
X Y Z A
• MLPs can compute more complex Boolean functions
• MLPs can compute any Boolean function
– Since they can emulate individual gates
• MLPs are universal Boolean functions
MLP as Boolean Functions
2
1 1
0 1
1 -1 1 1
2 2 1 2
1 1 1 -1 1 -1
1 1 1
X Y Z A
X1 X2 X3 X4 X5 Y
0 0 1 1 0 1
0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1 1
X1 X2 X3 X4 X5 Y 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 0 1 1 0 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1 1
X1 X2 X3 X4 X5 Y 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 0 1 1 0 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1
X1 1X2 X3 X4 X5
• Any truth table can be expressed in this manner!
• A one-hidden-layer MLP is a Universal Boolean Function
Recap: The MLP as a classifier
784 dimensions
(MNIST)
784 dimensions
x2
x3
1
x2 w1x1+w2x2=T
xN
0
x1
i i
i
• A perceptron operates on x2
x1
real-valued vectors
– This is a linear classifier 78
Boolean functions with a
real perceptron
0,1 1,1 0,1 1,1 0,1 1,1
X Y X
x1
82
Booleans over the reals
x2
x1
x2 x1
x2
x
1
x2 x1
x2
x1
x2 x1
x2
x1
x2 x1
x
2
x
1
x2 x1
i
x2
4 i=1
4 AND
3 3
5
y1 y2 y3 y4 y5
4 x11
4
3 3
4 x2 x1
3 4 3
5
5 5
4 6 4
5 5
3 5 3
4 4
3
N
Σ yi ≥ 6?
i=1
y1 y2 y3 y4 y5 y6
Σ yi ≥ 𝑁?
i=1
y1 y2 y3 y4 y5
x2 x1
101
In the limit
N
Σ yi ≥ 𝑁?
i=1
y1 y2 y3 y4 y5
x2 x1
N/2
1 ra dius
• i i π 𝐱–cent
– Value of the sum at the output unit, as a function of distance from center, as N increases
• For small radius, it’s a near perfect cylinder
– N in the cylinder, N/2 outside
102
Composing a circle
N
N Σ yi ≥ 𝑁?
i=1
N/2
𝑲𝑵
𝑵
Σ 𝐲𝒊 − ≥ 𝟎?
𝟐
𝒊=𝟏
𝑲𝑵
𝑵
Σ 𝐲𝒊 − ≥ 𝟎?
𝟐
𝒊=𝟏
1
n
T1
1
T1 x
1 f(x)
x T1 T2 x
1 -1
T2 +
1 n
2
T2
+
-N/2
1
0
129
MLP as a continuous-valued
function
1
2
1
+ n n
i (i – 1)K +j 1
2
i=1,j=1
1
+ n n
1 nK
2
• Assuming that we have eliminated the threshold, each training case can
be represented as a hyperplane through the origin.
– The weights must lie on one side of this hyper-plane to get the answer
correct.
Weight space
• Each training case defines a plane
(shown as a black line) an input
good vector with
– The plane goes through the origin weight correct
and is perpendicular to the input vector answer=1
vector.
– On one side of the plane the output
is wrong because the scalar product
of the weight vector with the input
o
vector has the wrong sign.
the origin
bad
weight
vector
Weight space
• Each training case defines a plane
(shown as a black line)
– The plane goes through the origin
and is perpendicular to the input
vector.
bad
– On one side of the plane the output weights an input
is wrong because the scalar product vector with
of the weight vector with the input correct
good
answer=0
vector has the wrong sign. weights
o
the
origin
The cone of feasible solutions an input
vector with
correct
• To get all training cases right we need to
answer=0
find a point on the right side of all the
planes.
– There may not be any such point! bad
• If there are any weight vectors that get the good weights
right answer for all cases, they lie in a weights
hyper-cone with its apex at the origin.
– So the average of two good weight
vectors is a good weight vector.
• The problem is convex. an input o
vector with the origin
correct
answer=1
Learning with hidden units
• Networks without hidden units are very limited in the input-output mappings they can
learn to model.
– More layers of linear units do not help. Its still linear.
– Fixed output non-linearities are not enough.
• We need multiple layers of adaptive, non-linear hidden units. But how can we train such
nets?
– We need an efficient way of adapting all the weights, not just the last layer. This is
hard.
– Learning the weights going into hidden units is equivalent to learning features.
– This is difficult because nobody is telling us directly what the hidden units should do.
LEARNING THE WEIGHTS OF A LOGISTIC
OUTPUT NEURON
Logistic neurons
z = b + å xi wi
• These give a real-valued 1
y=
output that is a smooth and
bounded function of their 1+ e
-z
i
total input.
1
– They have nice
derivatives which make y 0.5
learning easy.
0
0 z
The derivatives of a logistic neuron
z = b + å xi wi 1
y=
i
1+ e
-z
¶z ¶z
= xi = wi dy
¶wi ¶xi = y (1 - y)
dz
The derivatives of a logistic neuron
= (1 + e-z )-1
1
y=
1+ e
-z
dy -1(-e-z ) æ 1 ö æ e-z ö
= = ç ÷ çç ÷ = y(1- y)
dz (1 + e-z )2 è 1 + e-z ø è 1 + e-z ÷ø
• To learn the weights we need the derivative of the output with respect
to each weight:
¶y ¶z dy
= = xi y (1- y)
¶wi ¶wi dz
delta-rule
¶E ¶y n ¶E
¶wi
= å ¶wi ¶y n
= - å xi
n n
y (1- y n
) (t n
- y n
)
n n
extra term = slope of logistic