0% found this document useful (0 votes)
27 views

L2 Perceptrons, Function Approximation, Classification

This document discusses neural networks and deep learning. It provides an overview of perceptrons and feedforward neural networks. Specifically, it describes Frank Rosenblatt's original perceptron model from 1958 which used groups of sensors and association units to combine inputs and determine responses. It notes that even a single perceptron unit cannot represent the XOR function. The document emphasizes that networked neural units are needed for more complex computations, and introduces multi-layer perceptron networks.

Uploaded by

Uzair Siddiqui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

L2 Perceptrons, Function Approximation, Classification

This document discusses neural networks and deep learning. It provides an overview of perceptrons and feedforward neural networks. Specifically, it describes Frank Rosenblatt's original perceptron model from 1958 which used groups of sensors and association units to combine inputs and determine responses. It notes that even a single perceptron unit cannot represent the XOR function. The document emphasizes that networked neural units are needed for more complex computations, and introduces multi-layer perceptron networks.

Uploaded by

Uzair Siddiqui
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 89

Neural Networks and Deep

Learning
Perceptrons, Feed-Forward Nets and
partition function specifications
PERCEPTRONS
A better model

• Frank Rosenblatt
– Psychologist, Logician
– Inventor of the solution to everything, aka the Perceptron (1958)

69
Rosenblatt’s perceptron

• Original perceptron model


– Groups of sensors (S) on retina combine onto cells in association
area A1
– Groups of A1 cells combine into Association cells A2
– Signals from A2 cells combine into response cells R
– All connections may be excitatory or inhibitory 70
Rosenblatt’s perceptron

• Even included feedback between A and R cells


– Ensures mutually exclusive outputs

71
Rosenblatt’s perceptron

• Simplified perceptron model


– Association units combine sensory input with fixed
weights
– Response units combine associative units with
learnable weights 72
Perceptron: Simplified model

• Number of inputs combine linearly


– Threshold logic: Fire if combined input exceeds threshold
The Universal Model
• Originally assumed could represent any Boolean circuit and
perform any logic
– “the embryo of an electronic computer that [the Navy] expects
will be able to walk, talk, see, write, reproduce itself and be
conscious of its existence,” New York Times (8 July) 1958
– “Frankenstein Monster Designed by Navy That Thinks,” Tulsa,
Oklahoma Times 1958
Linearity or affinity
1
Linear vs Affine?
1
2
2

.
i i
3 i
3

. +
N
N

• A threshold unit
– “Fires” if the affine function of inputs is positive
• The bias is the negative of the threshold T in the previous
slide
10
The standard paradigm for The standard Perceptron
statistical pattern recognition architecture

1. Convert the raw input vector into a vector


of feature activations. decision unit
Use hand-written programs based on
common-sense to define the features. learned weights
2. Learn how to weight each of the feature
activations to get a single scalar quantity.
feature units
3. If this quantity is above some threshold,
decide that the input vector is a positive
example of the target class. hand-coded weights
or programs

input units
How to learn biases using the same rule
as we use for learning weights

• A threshold is equivalent to having a


negative bias.
• We can avoid having to figure out a
separate learning rule for the bias by
using a trick:
– A bias is exactly equivalent to a weight
on an extra input line that always has
an activity of 1. b w1 w2
– We can now learn a bias as if it were a
weight. 1 x1 x2
Also provided a learning algorithm

Sequential Learning:
is the desired output in response to input
is the actual output in response to

• Boolean tasks
• Update the weights whenever the perceptron output is
wrong
– Update the weight by the product of the input and the
error between the desired and actual outputs
• Proved convergence for linearly separable classes
Perceptron
X 1
-1
2 X 0
1

Y
X 1

1
1

Y Values shown on edges are weights,


numbers in the circles are thresholds

• Easily shown to mimic any Boolean gate


• But…
Individual units

No solution for XOR!

X ?

?
?

Y
A single neuron is not enough

• Individual elements are weak computational elements


– Marvin Minsky and Seymour Papert, 1969, Perceptrons:
An Introduction to Computational Geometry

• Networked elements are required


Multi-layer Perceptron!
X 1

1
-1 1

2
1
1

-1
-1

Y
Hidden Layer

• XOR
– The first layer is a “hidden” layer
A more generic model

2
1 1
0 1
1 -1 1 1

2 2 1 2
1 1 1 -1 1 -1
1 1 1

X Y Z A

• A “multi-layer” perceptron
• Can compose arbitrarily complicated Boolean functions!
– In cognitive terms: Can compute arbitrary Boolean functions over
sensory input
– More on this in the next class
80
Neuron model: Logistic unit

Sigmoid (logistic) activation function.

Credit: Andrew Ng
Neural Network

Layer 1 Layer 2 Layer 3


Other network architectures

Layer 1 Layer 2 Layer 3 Layer 4


Deep Structures
• In any directed graph with input source nodes and
output sink nodes, “depth” is the length of the longest
path from a source to a sink
– A “source” node in a directed graph is a node that has only
outgoing edges
– A “sink” node is a node that has only incoming edges

• Left: Depth = 2. Right: Depth = 3


17
FEATURE-SPACE PARTITIONING
TOY EXAMPLE WITH HUMAN-
LEARNING OF WEIGHTS
Non-linear classification example: XOR/XNOR
, are binary (0 or 1).

x2
x2

x1

x1
Simple example: AND 1.0

0 0
0 1
1 0
1 1
Example: OR function

-10

20 0 0
20 0 1
1 0
1 1
Negation:

0
1
Putting it together:

-30 10 -10

20 -20 20
20 -20 20

0 0
0 1
1 0
1 1
Capabilities of Threshold Neurons
•What do we do if we need a more complex function?
•Just like Threshold Logic Units, we can also combine
multiple artificial neurons to form networks with increased
capabilities.
•For example, we can build a two-layer network with any
number of neurons in the first layer giving input to a single
neuron in the second layer.
•The neuron in the second layer could, for example,
implement an AND function.
Capabilities of Threshold Neurons
x1

x2

x1

x2 xi
.
.
.

x1

x2

•What kind of function can such a network realize?


Capabilities of Threshold Neurons
•Assume that the dotted lines in the diagram represent the input-
dividing lines implemented by the neurons in the first layer:

2nd comp.

1st comp.

Then, for example, the second-layer neuron could output 1 if


the input is within a polygon, and 0 otherwise.
Capabilities of Threshold Neurons
•However, we still may want to implement functions that
are more complex than that.
•An obvious idea is to extend our network even further.
•Let us build a network that has three layers, with
arbitrary numbers of neurons in the first and second layers
and one neuron in the third layer.
•The first and second layers are completely connected,
that is, each neuron in the first layer sends its output to
every neuron in the second layer.
Capabilities of Threshold Neurons
x1

x2

x1

x2 oi
. .
. .
. .

x1

x2

•What type of function can a three-layer network realize?


Capabilities of Threshold Neurons
•Assume that the polygons in the diagram indicate the input regions
for which each of the second-layer neurons yields output 1:

2nd comp.

1st comp.

Then, for example, the third-layer neuron could output 1 if the


input is within any of the polygons, and 0 otherwise.
Capabilities of Threshold Neurons
•The more neurons there are in the first layer, the more
vertices can the polygons have.
•With a sufficient number of first-layer neurons, the
polygons can approximate any given shape.
•The more neurons there are in the second layer, the more
of these polygons can be combined to form the output
function of the network.
•With a sufficient number of neurons and appropriate
weight vectors wi, a three-layer network of threshold
neurons can realize any function Rn → {0, 1}.
Pattern Separation and NN architecture
The perceptron convergence procedure:
Training binary output neurons as classifiers

• Add an extra component with value 1 to each input vector. The “bias” weight on this
component is minus the threshold. Now we can forget the threshold.
• Pick training cases using any policy that ensures that every training case will keep
getting picked.
– If the output unit is correct, leave its weights alone.
– If the output unit incorrectly outputs a zero, add the input vector to the weight
vector.
– If the output unit incorrectly outputs a 1, subtract the input vector from the
weight vector.
• This is guaranteed to find a set of weights that gets the right answer for all the
training cases if any such set exists.
FUNCTION APPROXIMATION,
CLASSIFICATION….

• Multi-layer Perceptrons as universal Boolean


functions
• MLPs as universal classifiers
• MLPs as universal approximators
The perceptron as a Boolean gate
X 1
-1
2 X 0
1

Y
X 1

1
1
Values in the circles are thresholds
Y Values on edges are weights

• A perceptron can model any simple binary


Boolean gate
Perceptron as a Boolean gate
1
1

1
L
-1
-1

-1 Will fire only if X1 .. XL are all 1


and XL+1 .. XN are all 0

• The universal AND gate


– AND any number of inputs
• Any subset of who may be negated
Perceptron as a Boolean gate
1
1

1
L-N+1
-1
-1

-1 Will fire only if any of X1 .. XL are 1


or any of XL+1 .. XN are 0

• The universal OR gate


– OR any number of inputs
• Any subset of who may be negated
Perceptron as a Boolean Gate
1
1
Will fire only if at least K inputs are 1
1
K
1
1

• Generalized majority gate


– Fire if at least K inputs are of the desired polarity
Perceptron as a Boolean Gate
1
1
Will fire only if the total number of
1
L-N+K of X1 .. XL that are 1 and X L+1 .. X N that
-1
are 0 is at least K
-1

-1

• Generalized majority gate


– Fire if at least K inputs are of the desired polarity

30
The perceptron is not enough

X ?

?
?

• Cannot compute an XOR


Multi-layer perceptron
X 1

1
-1 1

2
1
1

-1
-1

Y
Hidden Layer

• MLPs can compute the XOR


Multi-layer perceptron XOR
X
1

1
-2
2 1
1

1 Thanks to Gerald Friedland

• With 2 neurons
– 5 weights and two thresholds
Multi-layer perceptron
2
1 1
0 1
1 -1 1 1

2 2 1 2
1 1 1 -1 1 -1
1 1 1

X Y Z A
• MLPs can compute more complex Boolean functions
• MLPs can compute any Boolean function
– Since they can emulate individual gates
• MLPs are universal Boolean functions
MLP as Boolean Functions
2
1 1
0 1
1 -1 1 1

2 2 1 2
1 1 1 -1 1 -1
1 1 1

X Y Z A

• MLPs are universal Boolean functions


– Any function over any number of inputs and any number of outputs
• But how many “layers” will they need?
How many layers for a Boolean
MLP?
Truth table shows all input combinations
Truth Table for which output is 1

X1 X2 X3 X4 X5 Y
0 0 1 1 0 1
0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1 1

• A Boolean function is just a truth table


How many layers for a Boolean
MLP?
Truth table shows all input combinations
Truth Table for which output is 1

X1 X2 X3 X4 X5 Y 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 0 1 1 0 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1 1

• Expressed in disjunctive normal form


How many layers for a Boolean
MLP?
Truth table shows all input combinations
Truth Table for which output is 1

X1 X2 X3 X4 X5 Y 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
0 0 1 1 0 1 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

0 1 0 1 1 1
0 1 1 0 0 1
1 0 0 0 1 1
1 0 1 1 1 1
1 1 0 0 1
X1 1X2 X3 X4 X5
• Any truth table can be expressed in this manner!
• A one-hidden-layer MLP is a Universal Boolean Function
Recap: The MLP as a classifier

784 dimensions
(MNIST)
784 dimensions

• MLP as a function over real inputs


• MLP as a function that finds a complex “decision
boundary” over a space of reals
77
A Perceptron on Reals
x1

x2

x3
1
x2 w1x1+w2x2=T

xN
0
x1
i i
i

• A perceptron operates on x2
x1
real-valued vectors
– This is a linear classifier 78
Boolean functions with a
real perceptron
0,1 1,1 0,1 1,1 0,1 1,1

X Y X

0,0 Y 1,0 0,0 X 1,0 0,0 Y 1,0

• Boolean perceptrons are also linear classifiers


– Purple regions are 1
79
Composing complicated
“decision”
boundaries
x2 Can now be composed into
“networks” to compute arbitrary
classification “boundaries”

x1

• Build a network of units with a single output


that fires if the input is in the coloured area

82
Booleans over the reals

x2

x1

x2 x1

• The network must fire if the input is in the


coloured area
83
Booleans over t he reals

x2

x
1

x2 x1

• The network must fire if the input is in the


coloured area
84
Booleans over the reals

x2

x1

x2 x1

• The network must fire if the input is in the


coloured area
85
Booleans over the reals

x2

x1

x2 x1

• The network must fire if the input is in the


coloured area
86
Booleans over the reals

x
2

x
1

x2 x1

• The network must fire if the input is in the


coloured area
87
Booleans over t he reals
3 N

i
x2
4 i=1
4 AND
3 3
5
y1 y2 y3 y4 y5
4 x11
4

3 3
4 x2 x1

• The network must fire if the input is in the coloured area


– The AND compares the sum of the hidden outputs to 5
• NB: What would the pattern be if it compared it to 4?
88
Composing a hexagon
3

3 4 3
5
5 5
4 6 4
5 5
3 5 3
4 4

3
N

Σ yi ≥ 6?
i=1

y1 y2 y3 y4 y5 y6

• The polygon net


x2 x1
How about a heptagon

• What are the sums in the different regions?


– A pattern emerges as we consider N > 6..
• N is the number of sides of the polygon
16 sides

• What are the sums in the different regions?


– A pattern emerges as we consider N > 6..
64 sides

• What are the sums in the different regions?


– A pattern emerges as we consider N > 6..
1000 sides

• What are the sums in the different regions?


– A pattern emerges as we consider N > 6..
100
Polygon net
N

Σ yi ≥ 𝑁?
i=1

y1 y2 y3 y4 y5

x2 x1

101
In the limit
N

Σ yi ≥ 𝑁?
i=1

y1 y2 y3 y4 y5

x2 x1

N/2

1 ra dius
• i i π 𝐱–cent

– Value of the sum at the output unit, as a function of distance from center, as N increases
• For small radius, it’s a near perfect cylinder
– N in the cylinder, N/2 outside
102
Composing a circle
N
N Σ yi ≥ 𝑁?
i=1

N/2

• The circle net


– Very large number of neurons
– Sum is N inside the circle, N/2 outside almost everywhere
– Circle can be at any location
103
Adding circles
𝟐𝑵
𝑵
Σ 𝐲𝒊 − ≥ 𝟎?
𝟐
𝒊=𝟏

• The “sum” of two circles sub nets is exactly N/2 inside


either circle, and 0 almost everywhere outside
105
Composing an arbitrary figure

𝑲𝑵
𝑵
Σ 𝐲𝒊 − ≥ 𝟎?
𝟐
𝒊=𝟏

• Just fit in an arbitrary number of circles


– More accurate approximation with greater number of
smaller circles
– Can achieve arbitrary precision
106
MLP: Universal classifier

𝑲𝑵
𝑵
Σ 𝐲𝒊 − ≥ 𝟎?
𝟐
𝒊=𝟏

• MLPs can capture any classification boundary


• A one-hidden-layer MLP can model any
classification boundary
• MLPs are universal classifiers 107
MLP as a continuous-valued
regression
T1
T1
1 1 f(x)
x +
T1 T2 x
1 -1
T2
T2

• A simple 3-unit MLP with a “summing” output unit can


generate a “square pulse” over an input
– Output is 1 only if the input lies between T1 and T2
– T1 and T2 can be arbitrarily specified
127
MLP as a continuous-valued
regression
2

1
n
T1
1
T1 x
1 f(x)
x T1 T2 x
1 -1
T2 +
1 n
2
T2

• A simple 3-unit MLP can generate a “square pulse” over an input


• An MLP with many units can model an arbitrary function over an input
– To arbitrary precision
• Simply make the individual pulses narrower
• A one-hidden-layer MLP can model an arbitrary function of a single input
128
For higher dimensions
N/2

+
-N/2
1
0

• An MLP can compose a cylinder


– in the circle, 0 outside

129
MLP as a continuous-valued
function
1
2

1
+ n n

• MLPs can actually compose arbitrary functions in any number of


dimensions!
– Even with only one hidden layer
• As sums of scaled and shifted cylinders
– To arbitrary precision
• By making the cylinders thinner
– The MLP is a universal approximator!
130
MLPs with additive output units
are universal approximators
N,K

i (i – 1)K +j 1
2
i=1,j=1

1
+ n n
1 nK
2

• MLPs can actually compose arbitrary functions


• But explanation so far only holds if the output
unit only performs summation
– i.e. does not have an additional “activation”
133
The network as a function
N

• Output unit with activation function


– Threshold or Sigmoid, or any other
• The network is actually a universal map from the entire domain of input values to
the entire range of the output activation
– All values the activation function of the output neuron
135
A GEOMETRICAL VIEW OF
PERCEPTRONS
Weight-space

• This space has one dimension per weight.

• A point in the space represents a particular setting of all the weights.

• Assuming that we have eliminated the threshold, each training case can
be represented as a hyperplane through the origin.
– The weights must lie on one side of this hyper-plane to get the answer
correct.
Weight space
• Each training case defines a plane
(shown as a black line) an input
good vector with
– The plane goes through the origin weight correct
and is perpendicular to the input vector answer=1
vector.
– On one side of the plane the output
is wrong because the scalar product
of the weight vector with the input
o
vector has the wrong sign.
the origin

bad
weight
vector
Weight space
• Each training case defines a plane
(shown as a black line)
– The plane goes through the origin
and is perpendicular to the input
vector.
bad
– On one side of the plane the output weights an input
is wrong because the scalar product vector with
of the weight vector with the input correct
good
answer=0
vector has the wrong sign. weights

o
the
origin
The cone of feasible solutions an input
vector with
correct
• To get all training cases right we need to
answer=0
find a point on the right side of all the
planes.
– There may not be any such point! bad
• If there are any weight vectors that get the good weights
right answer for all cases, they lie in a weights
hyper-cone with its apex at the origin.
– So the average of two good weight
vectors is a good weight vector.
• The problem is convex. an input o
vector with the origin
correct
answer=1
Learning with hidden units
• Networks without hidden units are very limited in the input-output mappings they can
learn to model.
– More layers of linear units do not help. Its still linear.
– Fixed output non-linearities are not enough.
• We need multiple layers of adaptive, non-linear hidden units. But how can we train such
nets?
– We need an efficient way of adapting all the weights, not just the last layer. This is
hard.
– Learning the weights going into hidden units is equivalent to learning features.
– This is difficult because nobody is telling us directly what the hidden units should do.
LEARNING THE WEIGHTS OF A LOGISTIC
OUTPUT NEURON
Logistic neurons

z = b + å xi wi
• These give a real-valued 1
y=
output that is a smooth and
bounded function of their 1+ e
-z
i
total input.
1
– They have nice
derivatives which make y 0.5
learning easy.
0
0 z
The derivatives of a logistic neuron

• The derivatives of the logit, z, • The derivative of the output with


with respect to the inputs and the respect to the logit is simple if you
weights are very simple: express it in terms of the output:

z = b + å xi wi 1
y=
i
1+ e
-z
¶z ¶z
= xi = wi dy
¶wi ¶xi = y (1 - y)
dz
The derivatives of a logistic neuron

= (1 + e-z )-1
1
y=
1+ e
-z

dy -1(-e-z ) æ 1 ö æ e-z ö
= = ç ÷ çç ÷ = y(1- y)
dz (1 + e-z )2 è 1 + e-z ø è 1 + e-z ÷ø

e-z (1+ e-z ) -1 (1+ e-z ) -1


because = = = 1- y
1+ e
-z 1+ e
-z 1+ e
-z 1 + e-z
Using the chain rule to get the derivatives needed for
learning the weights of a logistic unit

• To learn the weights we need the derivative of the output with respect
to each weight:
¶y ¶z dy
= = xi y (1- y)
¶wi ¶wi dz
delta-rule
¶E ¶y n ¶E
¶wi
= å ¶wi ¶y n
= - å xi
n n
y (1- y n
) (t n
- y n
)
n n
extra term = slope of logistic

You might also like