Intro To Neural Networks
Intro To Neural Networks
Logistics
• Neural networks.
• Training perceptrons.
• Gradient descent.
• Backpropagation.
weights
inputs output
y=
The Perceptron
weights
inputs output
b y=
1
The Perceptron
weights
inputs output
y=
weights
inputs output
bias
Another way to draw it…
weights
(1) Combine the sum and
activation function
inputs output
Activation Function
(e.g., Sigmoid function of weighted sum)
output
Neural Network
a collection of connected perceptrons
Neural Network
a collection of connected perceptrons
‘one perceptron’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘two perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘three perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘four perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘five perceptrons’
Connect a bunch of perceptrons together …
Neural Network
a collection of connected perceptrons
‘six perceptrons’
Some terminology…
‘input’ layer
‘hidden’ layer
‘input’ layer
‘hidden’ layer
‘input’ layer
‘output’ layer
10 10.1
2 1.9
3.5 3.4
1 1.1
10 10.1
2 1.9
3.5 3.4
1 1.1
and a perceptron
and a perceptron
Loss Function
defines what is means to be
close to the true solution
and a perceptron
input output
Given a
fixed-point on a function,
move in the direction
opposite of the gradient
Gradient descent:
update rule:
Backpropagation
Training the world’s smallest perceptron
This is just gradient
descent, that means…
=
Now where does this come from?
Compute the derivative
just shorthand
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
world’s (second) smallest
perceptron!
y = w1. x1 + w2 x2
.
1. Predict
a. Forward pass
we just need to compute partial
b. Compute Loss derivatives for this network
2. Update
a. Back Propagation
b. Gradient update
Derivative computation
^ ^
y^ = w1. x1 + w2 x2
.
Derivative computation
^ ^
Gradient Update
Gradient Descent
1. Predict
a. Forward pass
(side computation to track loss. not
b. Compute Loss needed for backprop)
a. Back Propagation
b. Gradient update
(adjustable step size)
We haven’t seen a lot of ‘propagation’ yet
because our perceptrons only had one layer…
multi-layer perceptron
^
Entire network can be written out as one long equation
^
known
unknown
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
b. Gradient update
vector of parameter update equations
So we need to compute the partial derivatives
^
Remember,
Partial derivative describes…
(loss layer)
depends on
depends on
depends on
rest of the network
Chain Rule!
rest of the network
a.k.a. backpropagation
The chain rule says…
depends on
depends on
already computed.
re-use (propagate)!
depends on
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
b. Gradient update
Gradient Descent
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
-
b. Gradient update
vector of parameter update equations
Stochastic gradient
descent
What we are truly minimizing:
𝑁𝑁
1. Predict
a. Forward pass
b. Compute Loss
2. Update
a. Back Propagation
vector of parameter partial derivatives
-
b. Gradient update
vector of parameter update equations
How do we select which sample?
How do we select which sample?
• Select randomly!
• Select randomly!
• Select randomly!
• Bad convergence.
• Select randomly!
• Bad convergence.