0% found this document useful (0 votes)
15 views44 pages

Perceptron 2014

The document discusses the perceptron, an early machine learning model. It introduces the perceptron as one of the first serious learning machines. It describes how the perceptron learns through a learning rule that updates its weights based on misclassified examples. The perceptron converges when its weights allow it to correctly classify all examples of linearly separable classes.

Uploaded by

alljenish1444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views44 pages

Perceptron 2014

The document discusses the perceptron, an early machine learning model. It introduces the perceptron as one of the first serious learning machines. It describes how the perceptron learns through a learning rule that updates its weights based on misclassified examples. The perceptron converges when its weights allow it to correctly classify all examples of linearly separable classes.

Uploaded by

alljenish1444
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

The Perceptron

Volker Tresp
Summer 2014

1
Introduction

• One of the first serious learning machines

• Most important elements in learning tasks

– Collection and preprocessing of training data


– Definition of a class of learning models. Often defined by the free parameters in a
learning model with a fixed structure (e.g., a Perceptron)
– Selection of a cost function
– Learning rule to find the best model in the class of learning models. Often this
means the learning of the optimal parameters

2
Prototypical Learning Task

• Classification of printed or handwritten digits

• Application: automatic reading of Zip codes

• More general: OCR (optical character recognition)

3
Transformation of the Raw Data (2-D) into Pattern Vectors
(1-D) as part of a Learning Matrix

4
Binary Classification

5
Data Matrix for Supervised Learning

M number of inputs
N number of training patterns
xi = (xi,0, . . . , xi,M −1)T
i-th input
xi,j j-th component of xi
X = (x1, . . . , xN )T
design matrix
yi i-th target for xi
y = (y1, . . . , yN )T
Vector of targets
ŷi prediction for xi
di = (xi,0, . . . , xi,M −1, yi)T
i-th pattern
D = {d1, . . . , dN }
training data
z test input
t unknown target for z

6
Model

7
A Biologically Motivated Model

8
Input-Output Models

• A biological system needs to make a decision, based on available senor information

• An OCR system classifies a hand written digit

• A prognostic system predicts tomorrow’s energy consumption

9
Supervised Learning

• In supervised learning one assumes that in training both inputs and outputs are availa-
ble

• For example, an input pattern might reflect the attributes of an object and the target
is the class membership of this object

• The goal is the correct classification for new patterns

• Linear classifier: one of the simplest but surprisingly powerful classifiers

• A linear classifier is particularly suitable, when the number of inputs M is large; if


this is not the case, one can transform the input data into a high-dimensional space,
where a linear classifier might be able to solve the problem; this idea is central to a
large portion of the lecture (basis functions, neural networks, kernel models)

• A linear classifier can be realized through a Perceptron, a single formalized neuron!

10
Supervised Learning and Learning of Decisions

• One might argue that learning is only of interest if it changes (future) behavior; at
least for a biological system

• Many decisions can be reduced to a supervised learning problem: if I can read a Zip
code correctly, I know where the letter should be sent

• Decision tasks can often be reduced to an intermediate supervised learning problem

• But who produces the targets for the intermediate task? For biological systems a
hotly debated issue: is supervised learning biologically relevant? Is only reinforcement
learning, based on rewards and punishment, biologically plausible?

11
The Perceptron: A Learning Machine

• The activation function of the Percep-


tron is weighted sum of inputs
M
X −1
hi = wj xi,j
j=0
(Note: xi,0 = 1 is a constant input,
such that w0 can be though of as a bias
• The binary classification yi ∈ {1, −1}
is calculated as

ŷi = sign(hi)
• The linear classification boundary (sepa-
rating hyperplane) is defined as
hi = 0

12
Perceptron as a Weighted Voting machine

• The Perceptron is often displayed as a


graphical model with one input node for
each input variable and with one output
node for the target
• The bias w0 determines the class when
all inputs are zero
• When xi,j = 1 the j-th input votes
with weight |wj | for class sign(wj )
• Thus, the response of the Perceptron can
be thought of as a weighted voting for a
class.

13
2-D Representation of the Decision Boundary

• The class boundaries are often displayed graphically with M = 3 (next slide)

• This provides some intuition

• But note, that this 2-D picture can be misleading, since the Perceptron is typically
employed in high-dimensional problems (M >> 1)

14
Two classes that are Linearly Separable

15
Perceptron Learning Rule

• We now need a learning rule to find optimal parameters w0, . . . , wM −1

• We define a cost function that is dependent on the training data and the parameters

• In the learning process (training), one attempts to find parameters that minimize the
cost function

16
The Perceptron Cost Function

• Goal: correct classification of the N training samples {y1, . . . , yN }

• The Perceptron cost function is

X N
X
cost = − yihi = |−yihi|+
i∈M i=1

where M ⊆ {1, . . . , N } is the index set of the currently misclassified patterns and
xi,j is the value of the j-th input in the i-th pattern. |arg|+ = max(arg, 0).

• Obviously, we get cost = 0 only, when all patterns are correctly classified (then
M ⊆ ∅ ); otherwise cost > 0, since yi and hi have different signs for misclassified
patterns

17
Contribution to the Cost Function of one Data Point

18
Gradient Descent

• Initialize parameters (typically small random values)

• In each learning step, change the parameters such that the cost function decreases

• Gradient decent: adapt the parameters in the direction of the negative gradient

• The partial derivative of the weights with respect to the parameters is (Example: wj )
∂cost X
=− yixi,j
∂wj
i∈M

• Thus, a sensible adaptation rule is


X
wj ←− wj + η yixi,j
i∈M

19
Gradient Descent with One Parameter (Conceptual)

20
Gradient Descent with Two Parameters (Conceptual)

21
The Perceptron-Learning Rule

• In the actual Perceptron learning rule, one presents randomly selected currently misclas-
sified patterns and adapts with only that pattern. This is biologically more plausible
and also leads to faster convergence. Let xt and yt be the training pattern in the t-th
step. One adapts t = 1, 2, . . .

wj ←− wj + ηytxt,j j = 1, . . . , M

• A weight increases, when (postsynaptic) y(t) and (presynaptic) xj (t) have the same
sign; different signs lead to a weight decrease (compare: Hebb Learning)

• η > 0 is the learning rate, typically 0 < η << 1

• Pattern-based learning is also called stochastic gradient descent (SGD)

22
Stochastic Gradient Descent (Conceptual)

23
Comments

• Convergence proof: with sufficiently small learning rate η and when the problem is
linearly separable, the algorithm converges and terminates after a finite number of
steps

• If classes are not linearly separable and with finite η there is no convergence

24
Example: Perceptron Learning Rule, η = 0.1

25
Linearly Separable Classes

26
Convergence and Degenerativity

27
Classes that Cannot be Separated with a Linear Classifier

28
The classical Example for Linearly Non-Separable Classes: XOR

29
Classes are Separable (Convergence)

30
Classes are not Separable (no Convergence)

31
Comments on the Perceptron

• Convergence can be very fast

• A linear classifiers is a very important basic building block: with M → ∞ most


problems become linearly separable!

• In some case, the data are already high-dimensional with M > 10000 (e.g., number
of possible key words in a text)

• In other cases, one first transforms the input data into a high-dimensional (sometimes
even infinite) space and applies the linear classifier in that space: kernel trick, Neural
Networks

• Considering the power of a single formalized neuron: how much computational power
might 100 billion neurons posses?

• Are there grandmother cells in the brain? Or grandmother areas?

32
Comments on the Perceptron (cont’d)

• The Perceptron learning rule is not used much any more

– No convergence, when classes are not separable


– Classification boundary is not unique

• Alterbatvie learning rules:

– Linear Support Vector Machine


– Fisher Linear Discriminant
– Logistic Regression

33
Application for a Linear Classifier; Analysis of fMRI Brain Scans
(Tom Mitchel et al., CMU)

• Goal: based on the image slices determine if someone thinks of tools, buildings, food,
or a large set of other semantic concepts

• The trained linear classifier is 90% correct and can. e.g., predict if someone reads
about tools or buildings

• The figure shows the voxels, which are most important for the classification task. All
three test persons display similar regions

34
Pattern Recognition Paradigm

• von Neumann: ... the brain uses a peculiar statistical language unlike that employed
in the operation of man-made computers...

• A classification decision is done in done by considering the complete input pattern, and
neither as a logical decision based on a small number of attributes nor as a complex
logical programm

• The linearly weighted sum corresponds more to a voting: each input has either a
positive or a negative influence on the classification decision

• Robustness: in high dimensions a single, possible incorrect, input has little influence

35
Afterword

36
Why Pattern Recognition?

• Alternative approach to pattern recognition: learning of simple close-to deterministic


rules (naive expectation)

• One of the big mysteries in machine learning is why rule learning is not very successful

• Problems: the learned rules are either trivial, known, or extremely complex and very
difficult to interpret

• This is in contrast to the general impression that the world is governed by simple rules

• Also: computer programs, machines ... follow simple deterministic if-then-rules?

37
Example: Birds Fly

• Define flying: using its own force, at leat 20m, at leat 1m high, at least one every day
in its adult life, ...

• A bird can fly if,

– it is not a penguin, or ....


– it is not seriously injured or dead
– it is not too old
– the wings have not been clipped
– it does not have a number of diseases
– it only lives in a stable
– it carries heavy weights
– ...

38
Pattern Recognition

• 90% of all birds fly

• Of all birds which do not belong to a flightless class 94% fly

• ... and which are not domesticated 96% ...

• Basic problem:

– Complexity of the underlying (deterministic) system


– Incomplete information

• Thus: success of statistical machine learning!

39
Example: Predicting Buying Pattern

40
Where Rule-Learning Works

• Technical human generated worlds (“Engine A always goes with transmission B”).

41

You might also like