Perceptron 2014
Perceptron 2014
Volker Tresp
Summer 2014
1
Introduction
2
Prototypical Learning Task
3
Transformation of the Raw Data (2-D) into Pattern Vectors
(1-D) as part of a Learning Matrix
4
Binary Classification
5
Data Matrix for Supervised Learning
M number of inputs
N number of training patterns
xi = (xi,0, . . . , xi,M −1)T
i-th input
xi,j j-th component of xi
X = (x1, . . . , xN )T
design matrix
yi i-th target for xi
y = (y1, . . . , yN )T
Vector of targets
ŷi prediction for xi
di = (xi,0, . . . , xi,M −1, yi)T
i-th pattern
D = {d1, . . . , dN }
training data
z test input
t unknown target for z
6
Model
7
A Biologically Motivated Model
8
Input-Output Models
9
Supervised Learning
• In supervised learning one assumes that in training both inputs and outputs are availa-
ble
• For example, an input pattern might reflect the attributes of an object and the target
is the class membership of this object
10
Supervised Learning and Learning of Decisions
• One might argue that learning is only of interest if it changes (future) behavior; at
least for a biological system
• Many decisions can be reduced to a supervised learning problem: if I can read a Zip
code correctly, I know where the letter should be sent
• But who produces the targets for the intermediate task? For biological systems a
hotly debated issue: is supervised learning biologically relevant? Is only reinforcement
learning, based on rewards and punishment, biologically plausible?
11
The Perceptron: A Learning Machine
ŷi = sign(hi)
• The linear classification boundary (sepa-
rating hyperplane) is defined as
hi = 0
12
Perceptron as a Weighted Voting machine
13
2-D Representation of the Decision Boundary
• The class boundaries are often displayed graphically with M = 3 (next slide)
• But note, that this 2-D picture can be misleading, since the Perceptron is typically
employed in high-dimensional problems (M >> 1)
14
Two classes that are Linearly Separable
15
Perceptron Learning Rule
• We define a cost function that is dependent on the training data and the parameters
• In the learning process (training), one attempts to find parameters that minimize the
cost function
16
The Perceptron Cost Function
X N
X
cost = − yihi = |−yihi|+
i∈M i=1
where M ⊆ {1, . . . , N } is the index set of the currently misclassified patterns and
xi,j is the value of the j-th input in the i-th pattern. |arg|+ = max(arg, 0).
• Obviously, we get cost = 0 only, when all patterns are correctly classified (then
M ⊆ ∅ ); otherwise cost > 0, since yi and hi have different signs for misclassified
patterns
17
Contribution to the Cost Function of one Data Point
18
Gradient Descent
• In each learning step, change the parameters such that the cost function decreases
• Gradient decent: adapt the parameters in the direction of the negative gradient
• The partial derivative of the weights with respect to the parameters is (Example: wj )
∂cost X
=− yixi,j
∂wj
i∈M
19
Gradient Descent with One Parameter (Conceptual)
20
Gradient Descent with Two Parameters (Conceptual)
21
The Perceptron-Learning Rule
• In the actual Perceptron learning rule, one presents randomly selected currently misclas-
sified patterns and adapts with only that pattern. This is biologically more plausible
and also leads to faster convergence. Let xt and yt be the training pattern in the t-th
step. One adapts t = 1, 2, . . .
wj ←− wj + ηytxt,j j = 1, . . . , M
• A weight increases, when (postsynaptic) y(t) and (presynaptic) xj (t) have the same
sign; different signs lead to a weight decrease (compare: Hebb Learning)
22
Stochastic Gradient Descent (Conceptual)
23
Comments
• Convergence proof: with sufficiently small learning rate η and when the problem is
linearly separable, the algorithm converges and terminates after a finite number of
steps
• If classes are not linearly separable and with finite η there is no convergence
24
Example: Perceptron Learning Rule, η = 0.1
25
Linearly Separable Classes
26
Convergence and Degenerativity
27
Classes that Cannot be Separated with a Linear Classifier
28
The classical Example for Linearly Non-Separable Classes: XOR
29
Classes are Separable (Convergence)
30
Classes are not Separable (no Convergence)
31
Comments on the Perceptron
• In some case, the data are already high-dimensional with M > 10000 (e.g., number
of possible key words in a text)
• In other cases, one first transforms the input data into a high-dimensional (sometimes
even infinite) space and applies the linear classifier in that space: kernel trick, Neural
Networks
• Considering the power of a single formalized neuron: how much computational power
might 100 billion neurons posses?
32
Comments on the Perceptron (cont’d)
33
Application for a Linear Classifier; Analysis of fMRI Brain Scans
(Tom Mitchel et al., CMU)
• Goal: based on the image slices determine if someone thinks of tools, buildings, food,
or a large set of other semantic concepts
• The trained linear classifier is 90% correct and can. e.g., predict if someone reads
about tools or buildings
• The figure shows the voxels, which are most important for the classification task. All
three test persons display similar regions
34
Pattern Recognition Paradigm
• von Neumann: ... the brain uses a peculiar statistical language unlike that employed
in the operation of man-made computers...
• A classification decision is done in done by considering the complete input pattern, and
neither as a logical decision based on a small number of attributes nor as a complex
logical programm
• The linearly weighted sum corresponds more to a voting: each input has either a
positive or a negative influence on the classification decision
• Robustness: in high dimensions a single, possible incorrect, input has little influence
35
Afterword
36
Why Pattern Recognition?
• One of the big mysteries in machine learning is why rule learning is not very successful
• Problems: the learned rules are either trivial, known, or extremely complex and very
difficult to interpret
• This is in contrast to the general impression that the world is governed by simple rules
37
Example: Birds Fly
• Define flying: using its own force, at leat 20m, at leat 1m high, at least one every day
in its adult life, ...
38
Pattern Recognition
• Basic problem:
39
Example: Predicting Buying Pattern
40
Where Rule-Learning Works
• Technical human generated worlds (“Engine A always goes with transmission B”).
41