0% found this document useful (0 votes)

4 views147 pages

Lec3 Learning

This document discusses the fundamentals of neural networks, focusing on the learning process, perceptron rules, and the architecture of multi-layer perceptrons (MLPs). It explains how neural networks can approximate any function given sufficient capacity and highlights the importance of learning parameters through empirical risk minimization and gradient descent. The document also covers the perceptron learning algorithm and its convergence properties, particularly in relation to linearly separable data.

Uploaded by

namanziez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views147 pages

Lec3 Learning

Uploaded by

namanziez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 147

Neural Networks

Learning the network: Part 1

11-785, Fall 2021
Lecture 3

1
Topics for the day
• The problem of learning
• The perceptron rule for perceptrons
– And its inapplicability to multi-layer perceptrons
• Greedy solutions for classification networks:
ADALINE and MADALINE
• Learning through Empirical Risk Minimization
• Intro to function optimization and gradient
descent
2
Recap

• Neural networks are universal function approximators

– Can model any Boolean function
– Can model any classification boundary
– Can model any continuous valued function

• Provided the network satisfies minimal architecture constraints

– Networks with fewer than the required number of parameters can be very
poor approximators

3
These boxes are functions
Voice Image
N.Net Transcription N.Net Text caption
signal

Game
N.Net Next move
State

• Take an input
• Produce an output
• Can be modeled by a neural network!
4
Questions

Something Something
N.Net out
in

• Preliminaries:
– How do we represent the input?
– How do we represent the output?
• How do we compose the network that performs
the requisite function?
5
Questions

Something Something
N.Net out
in

• Preliminaries:
– How do we represent the input?
– How do we represent the output?
• How do we compose the network that performs
the requisite function?
6
The original perceptron

• Simple threshold unit

– Unit comprises a set of weights and a threshold
7
Preliminaries: The units in the
network – the perceptron

.. +
.
• Perceptron
– General setting, inputs are real valued
– A bias representing a threshold to trigger the perceptron
– Activation functions are not necessarily threshold functions
• The parameters of the perceptron (which determine how it behaves) are
its weights and bias 8
Preliminaries: Redrawing the neuron

.. +
.

• The bias can also be viewed as the weight of another input

component that is always set to 1
– If the bias is not explicitly mentioned, we will implicitly be assuming
that every perceptron has an additional input that is always fixed at 1
9
First: the structure of the network

• We will assume a feed-forward network

– No loops: Neuron outputs do not feed back to their inputs directly or
indirectly
– Loopy networks are a future topic
• Part of the design of a network: The architecture
– How many layers/neurons, which neuron connects to which and how, etc.
• For now, assume the architecture of the network is capable of
representing the needed function
10
What we learn: The parameters of the
network
The network is a function f()
with parameters W which must
be set to the appropriate values
to get the desired behavior from
1 the net
1

1
• Given: the architecture of the network
• The parameters of the network: The weights and biases
– The weights associated with the blue arrows in the picture
• Learning the network : Determining the values of these
parameters such that the network computes the desired function
11
• Moving on..

12
The MLP can represent anything

• The MLP can be constructed to represent anything

• But how do we construct it?

13
Option 1: Construct by hand
0,1

-1,0 1,0

0,-1

• Given a function, handcraft a network to satisfy it

• E.g.: Build an MLP to classify this decision boundary
• Not possible for all but the simplest problems..
14
Option 1: Construct by hand
1 X2
1 -1 0,1

X1 X2

-1,0 1,0
X1

0,-1

Assuming simple perceptrons:

15
output = 1 if
Option 1: Construct by hand
X2 1
0,1 -1 -1

X1 X2

-1,0 1,0
X1

0,-1

Assuming simple perceptrons:

16
output = 1 if
Option 1: Construct by hand
X2
0,1

-1,0 1,0
X1

1
-1 1
0,-1

X1 X2

Assuming simple perceptrons:

17
output = 1 if
Option 1: Construct by hand
X2
0,1

-1,0 1,0
X1

1
1 1
0,-1
X1 X2

Assuming simple perceptrons:

18
output = 1 if
Option 1: Construct by hand
1 X2 1
1 -1 0,1 -1 -1

X1 X2 X1 X2

-1,0 1,0
X1

1
1 1 1
-1 1
0,-1
X1 X2
-4 X1 X2
1 1 1
1
1 1 1 1
1 -1 -1 -1 1
1 1 -1
Assuming simple perceptrons:
19
output = 1 if X1 X2
Option 1: Construct by hand
0,1

-1,0 1,0

0,-1

• Given a function, handcraft a network to satisfy it

• E.g.: Build an MLP to classify this decision boundary
• Not possible for all but the simplest problems..
20
Option 2: Automatic estimation
of an MLP

• More generally, given the function to

model, we can derive the parameters of the
network to model it, through computation 21
How to learn a network?

• When has the capacity to exactly represent

• is a divergence function that goes to zero when

22
Problem is unknown

• Function must be fully specified

– Known everywhere, i.e. for every input
• In practice we will not have such specification 23
Sampling the function

• Sample
– Basically, get input-output pairs for a number of samples of input
• Many samples (𝑋 , 𝑑 ), where 𝑑 = 𝑔 𝑋 + 𝑛𝑜𝑖𝑠𝑒

• Very easy to do in most problems: just gather training data

– E.g. set of images and their class labels
– E.g. speech recordings and their transcription 24
Drawing samples

• We must learn the entire function from these

few examples
– The “training” samples
25
Learning the function

• Estimate the network parameters to “fit” the training

points exactly
– Assuming network architecture is sufficient for such a fit
– Assuming unique output d at any X
• And hopefully the resulting function is also correct where we
don’t have training samples 26
Story so far
• “Learning” a neural network == determining the parameters of the
network (weights and biases) required for it to model a desired
function
– The network must have sufficient capacity to model the function

• Ideally, we would like to optimize the network to represent the

desired function everywhere
• However this requires knowledge of the function everywhere
• Instead, we draw “input-output” training instances from the
function and estimate network parameters to “fit” the input-output
relation at these instances
– And hope it fits the function elsewhere as well

27
Poll 1

28
Poll 1
• Since neural networks are universal approximators, any network of any
architecture can approximate any function to arbitrary precision (True or
False):
– True
– False

• Which of the following are true regarding how to compose a network to

approximate a given function?
– The network architecture must have sufficient capacity to model the
function
– The network is actually a parametric function, whose parameters are its
weights and biases
– The parameters must be learned to best approximate the target function
– The parameters can be perfectly learned from just a few training samples of
the target function, even if the actual target function is unknown.
29
Let’s begin with a simple task
• Learning a classifier
– Simpler than regressions

• This was among the earliest problems

addressed using MLPs

• Specifically, consider binary classification

– Generalizes to multi-class

30
History: The original MLP

.. +
.
• The original MLP as proposed by Minsky: a
network of threshold units
– But how do you train it?
• Given only “training” instances of input-output pairs
31
The simplest MLP: a single perceptron

x2 1 x2
x1

0 x1

• Learn this function

– A step function across a hyperplane
32
The simplest MLP: a single perceptron

x2
x2
x1

• Learn this function

– A step function across a hyperplane
– Given only samples from it
33
Learning the perceptron

x2 .. +
x1
.

• Given a number of input output pairs, learn the weights and bias

Boundary:
–

– Learn , given several pairs

34
Restating the perceptron
x1
x2

xN
WN+1
xN+1=1
• Restating the perceptron equation by adding another dimension to

where

• Note that the boundary is now a hyperplane through origin

35
The Perceptron Problem

• Find the hyperplane that perfectly separates the two

groups of points
– Note: is a vector that is orthogonal to the hyperplane
• In fact the equation for the hyperplane itself means “the set of all Xs that are
orthogonal to 𝑊”
36
The Perceptron Problem
Key: Red 1, Blue = -1

• Find the hyperplane that perfectly separates the two groups of points
– Let vector 𝑊 = 𝑤 , 𝑤 , … , 𝑤 and vector 𝑋 = 𝑥 , 𝑥 , … , 𝑥 , 1
– ∑ 𝑤 𝑋 = 𝑊 𝑋 is an inner product
– 𝑊 𝑋 = 0 is the hyperplane comprising all 𝑋s orthogonal to vector 𝑊
• Learning the perceptron = finding the weight vector 𝑊 for the separating hyperplane
• 𝑊 points in the direction of the positive class 37
The Perceptron Problem
Key: Red 1, Blue = -1

• Learning the perceptron: Find the weights vector such that

the plane described by perfectly separates the classes
– is positive for all red dots and negative for all blue ones
38
The online perceptron solution
• The popular solution, originally proposed by Rosenblatt
is an online algorithm
– The famous “perceptron” algorithm

• Initializes and incrementally updates it each time we

encounter an instance that is incorrectly classified
– Guaranteed to find the correct solution for linearly
separable data
– On following slides, but will not cover in class

39
Perceptron Algorithm: Summary

• Cycle through the training instances

• Only update on misclassified instances
• If instance misclassified:
– If instance is positive class (positive misclassified as
negative)

– If instance is negative class (negative misclassified

as positive)

40
Perceptron Learning Algorithm
• Given training instances
– or
Using a +1/-1 representation
• Initialize for classes to simplify
notation
• Cycle through the training instances:
• do
– For 𝑡𝑟𝑎𝑖𝑛

• If 𝑂(𝑋 ) ≠ 𝑦

• until no more classification errors

41
A Simple Method: The Perceptron
Algorithm

-1 (blue)
+1(Red)

• Initialize: Randomly initialize the hyperplane

– I.e. randomly initialize the normal vector
• Classification rule
– Vectors on the same side of the hyperplane as will be assigned +1 class,
and those on the other side will be assigned -1
• The random initial plane will make mistakes 42
Perceptron Algorithm

Initialization

-1 (blue)
+1(Red)

43
Perceptron Algorithm

-1 (blue)
+1(Red)

Misclassified negative instance

44
Perceptron Algorithm

-1 (blue)
+1(Red)

Misclassified negative instance, subtract it from W

45
Perceptron Algorithm

-1 (blue)
+1(Red)

The new weight

46
Perceptron Algorithm

-1 (blue)
+1(Red)

The new weight (and boundary)

47
Perceptron Algorithm

-1 (blue)
+1(Red)

Misclassified positive instance

48
Perceptron Algorithm

-1 (blue)
+1(Red)

Misclassified positive instance, add it to W

49
Perceptron Algorithm

-1 (blue)
+1(Red)

The new weight vector

50
Perceptron Algorithm

+1(Red)
-1 (blue)
The new decision boundary
Perfect classification, no more updates, we are done

51
Perceptron Algorithm

+1(Red)
-1 (blue)
The new decision boundary
Perfect classification, no more updates, we are done
If the classes are linearly separable, guaranteed to
converge in a finite number of steps 52
Convergence of Perceptron Algorithm
• Guaranteed to converge if classes are linearly
separable

– After no more than misclassifications

• Specifically when W is initialized to 0
– is length of longest training point
– is the best case closest distance of a training point
from the classifier
• Same as the margin in an SVM
– Intuitively – takes many increments of size to undo
an error resulting from a step of size
53
Perceptron Algorithm
g
g

-1(Red)
+1 (blue)

g is the best-case margin

R is the length of the longest vector
54
The Perceptron Solution:
when classes are not linearly separable
Key: Red 1, Blue = -1

• When classes are not linearly separable, not possible to find a separating hyperplane
– No “support” plane for reflected data
– Some points will always lie on the other side
• Model does not support perfect classification of this data
55
A simpler solution
Key: Red 1, Blue = -1

• Reflect all the negative instances across the origin

– Negate every component of vector 𝑋
• If we use class notation for the labels (instead of ), we can simply
write the “reflected” values as
– Will retain the features 𝑋 for the positive class, but reflect/negate them for the negative class
56
The Perceptron Solution
Key: Red 1, Blue = -1

• Learning the perceptron: Find a plane such that all the modified
( ) features lie on one side of the plane
– Such a plane can always be found if the classes are linearly separable
57
The Perceptron Solution:
Linearly separable case
Key: Red 1, Blue = -1

• When classes are linearly separable: a trivial solution

• Other solutions are also possible, e.g. max-margin solution 58

The Perceptron Solution:
when classes are not linearly separable
Key: Red 1, Blue = -1

• Learn an MLP for this function

– 1 in the yellow regions, 0 outside
• Using just the samples
• We know this can be perfectly represented using an MLP 60
More complex decision boundaries

x1 x2
x1
• Even using the perfect architecture…
• … can we use perceptron learning rules to learn
this classification function?
61
The pattern to be learned at the
lower level

x1 x2
x1

• The lower-level neurons are linear classifiers

– They require linearly separated labels to be learned
– The actually provided labels are not linearly separated
– Challenge: Must also learn the labels for the lowest units! 62
The pattern to be learned at the
lower level

x1 x2
x1

• Consider a single linear classifier that must be learned from the training data
– Can it be learned from this data?
–
•
63
Poll 2

64
Poll 2
• For the double-pentagon problem, given the data shown on slide 60 and
given that all but the one neuron highlighted in yellow are already
correctly learned, can we use the perceptron learning algorithm to learn
the one remaining neuron?
– Yes
– No

• What problems do you see in using the perceptron rule to learn the
remaining perceptron?
– Perceptron learning will require linearly separable classes to learn the model
that classifies the data perfectly, but the data are not linearly separable
– Perceptron learning will require relabelling the data to make them linearly
separable with the correct decision boundary

65
The pattern to be learned at the
lower level

x1 x2
x1

• Consider a single linear classifier that must be learned from the training data
– Can it be learned from this data?
– The individual classifier actually requires the kind of labelling shown here
• Which is not given!!
66
The pattern to be learned at the
lower level

x1 x2
x1

• The lower-level neurons are linear classifiers

– They require linearly separated labels to be learned
– The actually provided labels are not linearly separated
– Challenge: Must also learn the labels for the lowest units! 67
The pattern to be learned at the
lower level

x1 x2
x1

• For a single line:

– Try out every possible way of relabeling the blue dots
such that we can learn a line that keeps all the red dots
on one side! 68
The pattern to be learned at the
lower level

x1 x2
x1
• This must be done for each of the lines (perceptrons)
• Such that, when all of them are combined by the higher-
level perceptrons we get the desired pattern
– Basically an exponential search over inputs 69
Individual neurons represent one of the lines Must know the desired output of every
that compose the figure (linear classifiers)
neuron for every training instance, in
order to learn this neuron
The outputs should be such that the
neuron individually has a linearly
separable task
The linear separators must combine to
form the desired boundary
x2

This must be done for every neuron

Getting any of them wrong will result in

incorrect output! x1 x2
70
Learning a multilayer perceptron

Training data only specifies

input and output of network

Intermediate outputs (outputs

of individual neurons) are not specified

x1 x2
• Training this network using the perceptron rule is a combinatorial optimization
problem
• We don’t know the outputs of the individual intermediate neurons in the network
for any training input
• Must also determine the correct output for each neuron for every training
instance
• At least exponential (in inputs) time complexity!!!!!!
71
Greedy algorithms: Adaline and
Madaline
• Perceptron learning rules cannot directly be
used to learn an MLP
– Exponential complexity of assigning intermediate
labels
• Even worse when classes are not actually separable

• Can we use a greedy algorithm instead?

– Adaline / Madaline
– On slides, will skip in class (check the quiz)

72
A little bit of History: Widrow
Bernie Widrow
• Scientist, Professor, Entrepreneur
• Inventor of most useful things in
signal processing and machine
learning!

• First known attempt at an analytical solution to training

the perceptron and the MLP
• Now famous as the LMS algorithm
– Used everywhere
– Also known as the “delta rule”
73
History: ADALINE
Using 1-extended vector
notation to account for bias

• Adaptive linear element

(Hopf and Widrow, 1960)

• Actually just a regular perceptron

– Weighted sum on inputs and bias passed
through a thresholding function
• ADALINE differs in the learning rule
74
History: Learning in ADALINE

• During learning, minimize the squared

error assuming to be real output
• The desired output is still binary!

Error for a single input

75
History: Learning in ADALINE

Error for a single input

• If we just have a single training input,

the gradient descent update rule is

76
The ADALINE learning rule
• Online learning rule
• After each input , that has
target (binary) output , compute
and update:

• This is the famous delta rule

– Also called the LMS update rule

77
The Delta Rule
𝑑

• In fact both the Perceptron

𝑦

and ADALINE use variants Perceptron

of the delta rule!

𝑧
𝛿

– Perceptron: Output used in

delta rule is 𝑥 1

𝑑 𝑦

– ADALINE: Output used to

ADALINE
estimate weights is 𝑧

• For both 𝑥 1

78
Aside: Generalized delta rule
• For any differentiable activation function
the following update rule is used
𝒇(𝒛)

• This is the famous Widrow-Hoff update rule

– Lookahead: Note that this is exactly
backpropagation in multilayer nets if we let
represent the entire network between and
• It is possibly the most-used update rule in
machine learning and signal processing
– Variants of it appear in almost every problem
79
Multilayer perceptron: MADALINE
+ +
+
+ +

• Multiple Adaline
– A multilayer perceptron with threshold activations
– The MADALINE

80
MADALINE Training
-
+ +
+
+ +

• Update only on error

–
– On inputs for which output and target values differ

81
MADALINE Training
+ +
+
+ +

• While stopping criterion not met do:

– Classify an input
– If error, find the z that is closest to 0
– Flip the output of corresponding unit
– If error reduces:
• Set the desired output of the unit to the flipped value
• Apply ADALINE rule to update weights of the unit 82
MADALINE Training
-
+ +
+
+ +

• While stopping criterion not met do:

– Classify an input
– If error, find the z that is closest to 0
– Flip the output of corresponding unit and compute new output
– If error reduces:
• Set the desired output of the unit to the flipped value
• Apply ADALINE rule to update weights of the unit 84
MADALINE Training
-
+ +
+
+ +

• While stopping criterion not met do:

86
Story so far
• “Learning” a network = learning the weights and biases to compute a target function
– Will require a network with sufficient “capacity”
• In practice, we learn networks by “fitting” them to match the input-output relation of
“training” instances drawn from the target function

• A linear decision boundary can be learned by a single perceptron (with a threshold-

function activation) in linear time if classes are linearly separable

• Non-linear decision boundaries require networks of perceptrons

• Training an MLP with threshold-function activation perceptrons will require

knowledge of the input-output relation for every training instance, for every
perceptron in the network
– These must be determined as part of training
– For threshold activations, this is an NP-complexity combinatorial optimization problem

87
History..
• The realization that training an entire MLP was
a combinatorial optimization problem stalled
development of neural networks for well over
a decade!

88
Why this problem?

• The perceptron is a flat function with zero derivative everywhere,

except at 0 where it is non-differentiable
– You can vary the weights a lot without changing the error
– There is no indication of which direction to change the weights to
reduce error 89
This only compounds on larger
problems

x1 x2

• Individual neurons’ weights can change significantly without changing

overall error
• The simple MLP is a flat, non-differentiable function
– Actually a function with 0 derivative nearly everywhere, and no derivatives at
the boundaries
90
A second problem: What we actually
model

• Real-life data are rarely clean

– Not linearly separable
– Rosenblatt’s perceptron wouldn’t work in the first
place
91
Solution

.. +
.
• Lets make the neuron differentiable, with non-zero derivatives over
much of the input space
– Small changes in weight can result in non-negligible changes in output
– This enables us to estimate the parameters using gradient descent
techniques..
92
Differentiable activation function
y y

T1 x T2 x
• Threshold activation: shifting the threshold from T1 to T2 does not change
classification error
– Does not indicate if moving the threshold left was good or not

0.5 0.5

T1 T2
• Smooth, continuously varying activation: Classification based on whether the
output is greater than 0.5 or less
– Can now quantify how much the output differs from the desired target value (0 or 1)
– Moving the function left or right changes this quantity, even if the classification error itself
93
doesn’t change
Poll 3

94
Poll 3
• Which of the following are true of the threshold activation
– Increasing (or decreasing) the threshold will not change the overall classification error unless
the threshold moves past a misclassified training sample
– We cannot know if a change (increase of decrease) of the threshold moves it in the correct
direction that will result in a net decrease in classification error
– The derivative of the classification error with respect to the threshold gives us an indication of
whether to increase or decrease the threshold

• Which of the following are true of the continuous activation (sigmoid)

– Shifting the function left or right will not change the overall classification error unless the
crossover point (where the function crosses 0.5) moves past a misclassified training sample
– Shifting the function will change the total distance of the value of the function from its
target value at the training instances
– The derivative of the total distance with respect to the shift of the function gives us an
indication of which direction to shift the function to improve classification error

95
Continuous Activations

.. +
.
• Replace the threshold activation with continuous graded activations
– E.g. RELU, softplus, sigmoid etc.

• The activations are differentiable almost everywhere

– Have derivatives almost everywhere
– And have “subderivatives” at non-differentiable corners
• Bounds on the derivative that can substitute for derivatives in our setting
• More on these later 96
The sigmoid activation is special

.. +
.

• This particular one has a nice interpretation

• It can be interpreted as
97
Non-linearly separable data

x2
x1

• Two-dimensional example
– Blue dots (on the floor) on the “red” side
– Red dots (suspended at Y=1) on the “blue” side
– No line will cleanly separate the two colors 98
Non-linearly separable data: 1-D example
y

• One-dimensional example for visualization

– All (red) dots at Y=1 represent instances of class Y=1
– All (blue) dots at Y=0 are from class Y=0
– The data are not linearly separable
• In this 1-D example, a linear separator is a threshold
• No threshold will cleanly separate red and blue dots 99
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of Y=1 at that point
100
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
101
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
102
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
103
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
104
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
105
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
106
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
107
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
108
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
109
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
110
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
111
The probability of y=1
y

• Consider this differently: at each point look at a small

window around that point
• Plot the average value within the window
– This is an approximation of the probability of 1 at that point
112
The logistic regression model

y=1

y=0
x

• Class 1 becomes increasingly probable going left to right

– Very typical in many problems
113
Logistic regression
Decision: y > 0.5?

When X is a 2-D variable

• This the perceptron with a sigmoid activation
– It actually computes the probability that the input belongs to class 1
114
Perceptrons and probabilities
• We will return to the fact that perceptrons
with sigmoidal activations actually model class
probabilities in a later lecture

• But for now moving on..

115
Perceptrons with differentiable
activation functions

.. +
.

• is a differentiable function of
– is well-defined and finite for all
• Using the chain rule, is a differentiable function of both inputs 𝒊 and
weights 𝒊
• This means that we can compute the change in the output for small
changes in either the input or the weights 116
Overall network is differentiable
Figure does not show
bias connections

• Every individual perceptron is differentiable w.r.t its inputs and its

weights (including “bias” weight)
– Small changes in the parameters result in measurable changes in output
• Using the chain rule can compute how small perturbations of a
parameter change the output of the network
– The network output is differentiable with respect to the parameter 117
Overall function is differentiable
Figure does not show
bias connections

• By extension, the overall function is differentiable w.r.t every parameter in

the network
– We can compute how small changes in the parameters change the output
• For non-threshold activations the derivative are finite and generally non-zero
• We will derive the actual derivatives using the chain rule later
118
Overall setting for “Learning” the MLP

• Given a training set of input-output pairs 2

– is the desired output of the network in response to
– and may both be vectors
• …we must find the network parameters such that the network produces the
desired output for each training input
– Or a close approximation of it
– The architecture of the network must be specified by us
119
Recap: Learning the function

• When has the capacity to exactly represent

• div() is a divergence function that goes to zero when

120
Minimizing expected divergence

• More generally, assuming is a random variable

121
Recap: Sampling the function

• We don’t have g(X) so sample

– Obtain input-output pairs for a number of samples of input
– Good sampling: the samples of will be drawn from

• Estimate function from the samples

122
The Empirical risk

• The expected divergence (or risk) is the average divergence over the entire input space

• The empirical estimate of the expected risk is the average divergence over the samples

123
Empirical Risk Minimization

• Given a training set of input-output pairs 2

– Quantification of error on the ith instance:
– Empirical average divergence (Empirical Risk) on all training data:

• Estimate the parameters to minimize the empirical estimate of expected

divergence (empiricial risk)

– I.e. minimize the empirical risk over the drawn samples 124
Empirical Risk Minimization

Note : Its really a measure of error, but using standard terminology,

we will call it a “Loss”

Note 2: The empirical risk is only an empirical approximation

to the true risk which is our actual minimization
Given a training set of input-output pairs
• objective 2
– Error on the ith instance:
Note 3: For a average
– Empirical given training
error onset the loss
all training is only a function of W
data:

• Estimate the parameters to minimize the empirical estimate of expected

error

– I.e. minimize the empirical error over the drawn samples 125
ERM for neural networks
Actual output of network:

Desired output of network:

Error on i-th training input:
Average training error(loss):

– What is the exact form of Div()? More on this later

• Optimize network parameters to minimize the
total error over all training inputs 126
Problem Statement
• Given a training set of input-output pairs

• Minimize the following function

w.r.t

• This is problem of function minimization

– An instance of optimization
127
Story so far
• We learn networks by “fitting” them to training instances drawn from a target function

• Learning networks of threshold-activation perceptrons requires solving a hard

combinatorial-optimization problem
– Because we cannot compute the influence of small changes to the parameters on the overall error

• Instead we use continuous activation functions with non-zero derivatives to enables us

to estimate network parameters
– This makes the output of the network differentiable w.r.t every parameter in the network
– The logistic activation perceptron actually computes the a posteriori probability of the output given
the input

• We define differentiable divergence between the output of the network and the
desired output for the training instances
– And a total error, which is the average divergence over all training instances
• We optimize network parameters to minimize this error
– Empirical risk minimization
• This is an instance of function minimization
128
• A CRASH COURSE ON FUNCTION
OPTIMIZATION
– With an initial discussion of derivatives

129
A brief note on derivatives..

derivative

• A derivative of a function at any point tells us how

much a minute increment to the argument of the
function will increment the value of the function
 For any expressed as a multiplier to a tiny
increment to obtain the increments to the output

 Based on the fact that at a fine enough resolution, any

smooth, continuous function is locally linear at any point 130
Scalar function of scalar argument

• When and are scalar

 Derivative:

 Often represented as
 Or alternately (and more reasonably) as
131
Scalar function of scalar argument
0
0 --
+ -- +
+
--
+ - +
--
+ -- +
+
- -
- +
0

 Derivative is the rate of change of the function at

 How fast it increases with increasing 𝑥
 The magnitude of f’(x) gives you the steepness of the curve at x
 Larger |f’(x)|  the function is increasing or decreasing more rapidly

 It will be positive where a small increase in x results in an increase of f(x)

 Regions of positive slope
 It will be negative where a small increase in x results in a decrease of f(x)
 Regions of negative slope

 It will be 0 where the function is locally flat (neither increasing nor decreasing)
132
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector

• Giving us that is a row vector:

• The partial derivative gives us how increments when only is

incremented
• Often represented as

133
Multivariate scalar function:
Scalar function of vector argument
Note: is now a vector

We will be using this

• Where symbol for vector and
matrix derivatives

o You may be more familiar with the term “gradient” which

is actually defined as the transpose of the derivative
134
Gradient of a scalar function of a vector

• The derivative of a scalar function of a multi-variate input is a

multiplicative factor that gives us the change in for tiny variations in

–
• The gradient is the transpose of the derivative
– A column vector of the same dimensionality as
135
Gradient of a scalar function of a vector

• The derivative of a scalar function of a multi-variate input is a

multiplicative factor that gives us the change in for tiny variations in

–
• The gradient is the transpose of the derivative
This is– aAvector inner product. To understand its behavior lets
column vector of the same dimensionality as
consider a well-known property of inner products 136
A well-known vector property

• The inner product between two vectors of

fixed lengths is maximum when the two
vectors are aligned
– i.e. when
137
Properties of Gradient
vs angle of

Blue arrow
is

•
• For an increment of any given length is max if
is aligned with T

– The function f(X) increases most rapidly if the input increment

is exactly in the direction of T

• The gradient is the direction of fastest increase in f(X) 138

Gradient

Gradient
vector 𝑇

139
Gradient

Gradient
vector 𝑇

Moving in this
direction increases
fastest

140
Gradient

Gradient
vector 𝑇

Moving in this
𝑇
direction increases
Moving in this fastest
direction decreases
fastest

141
Gradient

Gradient here
is 0

142
Properties of Gradient: 2

• The gradient vector 𝑇 is perpendicular to the level curve

143
The Hessian
• The Hessian of a function is
given by the second derivative

 2 f 2 f 2 f 
 . . 
 21x
2
x1x2 x1xn 
  f 2 f 2 f 
 x x . .
 X f ( x1 ,..., xn ) :  2 1
2 x2
2
x2 xn 

 . . . . . 
 . . . . . 
 2 f  f
2
2 f 
 . . 
 xn x1 xn x2 2
xn 

144
Next up
• Continuing on function optimization

• Gradient descent to train neural networks

• A.K.A. Back propagation

145
Poll 4

146
Poll 4
• Select all that are true about derivatives of a
scalar function f(X) of multivariate inputs
– At any location X, there may be many directions
in which we can step, such that f(X) increases
– The direction of the gradient is the direction in
which the function increases fastest
– The gradient is the derivative of f(X) w.r.t. X

147

Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
51 pages
Module I
No ratings yet
Module I
109 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
4.0 The Complete Guide To Artificial Neural Networks
No ratings yet
4.0 The Complete Guide To Artificial Neural Networks
23 pages
Adaline and K
0% (1)
Adaline and K
29 pages
Dave Reed: Connectionist Approach To AI
No ratings yet
Dave Reed: Connectionist Approach To AI
26 pages
Unit 1 Until MLP
No ratings yet
Unit 1 Until MLP
56 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Mod 2.1,2.2
No ratings yet
Mod 2.1,2.2
24 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
54 pages
Unit 6 Application of AI
No ratings yet
Unit 6 Application of AI
91 pages
Inbound 8392301798635648784
No ratings yet
Inbound 8392301798635648784
43 pages
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
No ratings yet
CMPE 442 Introduction To Machine Learning: Artificial Neural Networks
65 pages
Neural Network Theory22
No ratings yet
Neural Network Theory22
60 pages
Neural Deep Learning
No ratings yet
Neural Deep Learning
221 pages
Neuro 3 PDF
No ratings yet
Neuro 3 PDF
36 pages
Neural Networks Neural Networks
No ratings yet
Neural Networks Neural Networks
30 pages
UNit 6 Machine Learning
No ratings yet
UNit 6 Machine Learning
23 pages
Module 3 Chap 4 ANNs
No ratings yet
Module 3 Chap 4 ANNs
69 pages
The Introduction To Neural Networks 10 4 24
No ratings yet
The Introduction To Neural Networks 10 4 24
54 pages
ML-Lec10-Artificial Neural Networks
No ratings yet
ML-Lec10-Artificial Neural Networks
76 pages
Lec 3 Learning The Network I
No ratings yet
Lec 3 Learning The Network I
139 pages
El Assignment
No ratings yet
El Assignment
10 pages
DL CHPT 1
No ratings yet
DL CHPT 1
59 pages
08 NN
No ratings yet
08 NN
117 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
ML Lec11
No ratings yet
ML Lec11
14 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
No ratings yet
Neural Network: Presented by Lecturer Dept. of Mechatronics Engineering Rajshahi University of Engineering & Technology
25 pages
Learning Algorithm
No ratings yet
Learning Algorithm
58 pages
CFBC 718 e 2 C
No ratings yet
CFBC 718 e 2 C
30 pages
Aiml Unit 5
No ratings yet
Aiml Unit 5
34 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Lecture 2
No ratings yet
Lecture 2
52 pages
AI17-Neural Networks
No ratings yet
AI17-Neural Networks
34 pages
P5 Neural Nets
No ratings yet
P5 Neural Nets
114 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
216 pages
Week-3 Module-2 Neural Network
No ratings yet
Week-3 Module-2 Neural Network
58 pages
Lecture NN 2005
No ratings yet
Lecture NN 2005
137 pages
Unit-4 MLT
No ratings yet
Unit-4 MLT
105 pages
L10 - Intro - To - Deep - Learning
No ratings yet
L10 - Intro - To - Deep - Learning
75 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
CV 2025 Spring 14
No ratings yet
CV 2025 Spring 14
33 pages
Neural Networks - V Unit
No ratings yet
Neural Networks - V Unit
43 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Fundamentals of Artificial Neural Networks
No ratings yet
Fundamentals of Artificial Neural Networks
27 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
Lecture-2 Learning Process45452465442
No ratings yet
Lecture-2 Learning Process45452465442
50 pages
Neural Network
No ratings yet
Neural Network
82 pages
Unit V
No ratings yet
Unit V
49 pages
Neural Networks: Some Material Adopted From Notes by
No ratings yet
Neural Networks: Some Material Adopted From Notes by
35 pages
Pattern Recognition & Analysis Assignment - Ii
No ratings yet
Pattern Recognition & Analysis Assignment - Ii
19 pages
A Presentation On: By: Edutechlearners
No ratings yet
A Presentation On: By: Edutechlearners
33 pages
Neural Network
No ratings yet
Neural Network
7 pages
Adaline
No ratings yet
Adaline
18 pages
Artificial Neural Network
100% (1)
Artificial Neural Network
35 pages
Bee4333 Intelligent Control: Artificial Neural Network (ANN)
No ratings yet
Bee4333 Intelligent Control: Artificial Neural Network (ANN)
120 pages
Important Questions
No ratings yet
Important Questions
19 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
No ratings yet
Module 4 - S8 CSE NOTES - KTU DEEP LEARNING NOTES - CST414
21 pages
IC Unit6 DeepLearning
No ratings yet
IC Unit6 DeepLearning
35 pages
CC511 Week 7 - Deep - Learning
No ratings yet
CC511 Week 7 - Deep - Learning
33 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
5 pages
Deep Learning 10 Hours
No ratings yet
Deep Learning 10 Hours
27 pages
DL Lab Manual
No ratings yet
DL Lab Manual
52 pages
Neural Network Notes
No ratings yet
Neural Network Notes
268 pages
Convolution Neural Networks Vs Fully Connected Neural Networks
No ratings yet
Convolution Neural Networks Vs Fully Connected Neural Networks
6 pages
Chapter 6 - Notes PDF
No ratings yet
Chapter 6 - Notes PDF
22 pages
Deep Learning Unit 1..
No ratings yet
Deep Learning Unit 1..
21 pages
Convolution Neural Networks
No ratings yet
Convolution Neural Networks
80 pages
Introduction To Neural Networks: Deep Learning For NLP
No ratings yet
Introduction To Neural Networks: Deep Learning For NLP
57 pages
Lect 4
No ratings yet
Lect 4
54 pages
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
No ratings yet
Lecture Notes For Chapter 4 Artificial Neural Networks Introduction To Data Mining, 2 Edition
22 pages
4 - Mcq-Ann-Ann-Quiz - Selected
No ratings yet
4 - Mcq-Ann-Ann-Quiz - Selected
13 pages
ANN Syllabus
No ratings yet
ANN Syllabus
5 pages
Lecture 07 LLaVA
No ratings yet
Lecture 07 LLaVA
27 pages
ANN Calculations
No ratings yet
ANN Calculations
24 pages
Deep Learning in Solving Mathematical Equations
No ratings yet
Deep Learning in Solving Mathematical Equations
14 pages
Lecture 26 RNN
No ratings yet
Lecture 26 RNN
16 pages
Project 5 - Traffic Sign Classification Using LeNet
No ratings yet
Project 5 - Traffic Sign Classification Using LeNet
13 pages
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
No ratings yet
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
10 pages
NNFLC 3 Unit Qbwa
No ratings yet
NNFLC 3 Unit Qbwa
15 pages
Multi-Layer Perceptrons (MLP) in R - GeeksforGeeks
No ratings yet
Multi-Layer Perceptrons (MLP) in R - GeeksforGeeks
8 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
From Everand
Bio Inspired Computing: Fundamentals and Applications for Biological Inspiration in the Digital World
Fouad Sabry
No ratings yet
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
From Everand
Hopfield Networks: Fundamentals and Applications of The Neural Network That Stores Memories
Fouad Sabry
No ratings yet