0% found this document useful (0 votes)
41 views100 pages

Intro To Neural Networks

The document introduces neural networks and the perceptron. It covers the basic concepts of the perceptron, how to connect perceptrons together into a neural network, and how to train perceptrons using gradient descent and backpropagation. It provides examples to illustrate neural network concepts and terminology.

Uploaded by

sahib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views100 pages

Intro To Neural Networks

The document introduces neural networks and the perceptron. It covers the basic concepts of the perceptron, how to connect perceptrons together into a neural network, and how to train perceptrons using gradient descent and backpropagation. It provides examples to illustrate neural network concepts and terminology.

Uploaded by

sahib
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 100

Introduction to neural networks

Logistics

• Check Canvas (announcements)

• Lecture slides in Canvas Files.

• Lab 1 is out. Check early if worried on pre-requisites.

• Office hours info TBA.


Overview of today’s lecture
• Perceptron.

• Neural networks.

• Training perceptrons.

• Gradient descent.

• Backpropagation.

• Stochastic gradient descent.


Perceptron
1950s Age of the Perceptron
1957 The Perceptron (Rosenblatt)
1969 Perceptrons (Minsky, Papert)

1980s Age of the Neural Network


1986 Back propagation (Hinton)

1990s Age of the Graphical Model


2000s Age of the Support Vector Machine

2010s Age of the Deep Network

deep learning = known algorithms + computing power + big data


The Perceptron

weights

sum sign function


(Heaviside step function)

inputs output

y=
The Perceptron

weights

sum sign function


(Heaviside step function)

inputs output

b y=

1
The Perceptron

weights

sum sign function


(Heaviside step function)

inputs output

y=

suppress the bias term


(less clutter)
Aside: Inspiration from Biology

Neural nets/perceptrons are loosely inspired by


biology.
But they certainly are not a model of how the brain
works, or even how neurons work.
The Perceptron

weights

sum sign function


(e.g., step,sigmoid, Tanh, ReLU)

inputs output

bias
Another way to draw it…

weights
(1) Combine the sum and
activation function

inputs output

Activation Function
(e.g., Sigmoid function of weighted sum)

(2) suppress the bias term


(less clutter)
Programming the 'forward pass'
Activation function (sigmoid, logistic function)
float f(float a)
{
return 1.0 / (1.0+ exp(-a));
}

output

Perceptron function (logistic regression)


float perceptron(vector<float> x, vector<float> w)
{
float a = dot(x,w);
return f(a);
}
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

How many perceptrons in this neural network?


Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘one perceptron’
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘two perceptrons’
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘three perceptrons’
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘four perceptrons’
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘five perceptrons’
Connect a bunch of perceptrons together …

Neural Network
a collection of connected perceptrons

‘six perceptrons’
Some terminology…

‘input’ layer

…also called a Multi-layer Perceptron (MLP)


Some terminology…

‘hidden’ layer
‘input’ layer

…also called a Multi-layer Perceptron (MLP)


Some terminology…

‘hidden’ layer
‘input’ layer
‘output’ layer

…also called a Multi-layer Perceptron (MLP)


this layer is a
‘fully connected layer’

all pairwise neurons between layers are connected


so is this

all pairwise neurons between layers are connected


How many neurons (perceptrons)?

How many weights (edges)?

How many learnable parameters total?


How many neurons (perceptrons)? 4+2=6

How many weights (edges)?

How many learnable parameters total?


How many neurons (perceptrons)? 4+2=6

How many weights (edges)? (3 x 4) + (4 x 2) = 20

How many learnable parameters total?


How many neurons (perceptrons)? 4+2=6

How many weights (edges)? (3 x 4) + (4 x 2) = 20

How many learnable parameters total? 20 + 4 + 2 = 26


bias terms
performance usually tops out at 2-3 layers,
deeper networks don’t really improve performance...

...with the exception of convolutional networks for images


Training perceptrons
world’s smallest perceptron!

(a.k.a. line equation, linear regression)


Learning a Perceptron

Given a set of samples and a Perceptron

Estimate the parameters of the Perceptron


Given training data:

10 10.1
2 1.9
3.5 3.4
1 1.1

What do you think the weight parameter is?


Given training data:

10 10.1
2 1.9
3.5 3.4
1 1.1

What do you think the weight parameter is?

not so obvious as the network gets more complicated so we use …


An Incremental Learning Strategy
(gradient descent)

Given several examples

and a perceptron

Modify weight such that gets ‘closer’ to

perceptron perceptron true


parameter output label
An Incremental Learning Strategy
(gradient descent)

Given several examples

and a perceptron

Modify weight such that gets ‘closer’ to

perceptron perceptron what does true


parameter output this mean? label
Before diving into gradient descent, we need to understand …

Loss Function
defines what is means to be
close to the true solution

YOU get to chose the loss function!


(some are better than others depending on what you want to do)
L1 Loss L2 Loss

Zero-One Loss Hinge Loss


Learning Strategy
(gradient descent)

Given several examples

and a perceptron

Modify weight such that gets ‘closer’ to

perceptron perceptron true


parameter output label
Gradient descent
Two ways to think about them:

Slope of a function Knobs on a machine


1. Slope of a function:

describes the slope around a


point
2. Knobs on a machine:

input output

describes how each


‘knob’ affects the output

small change in parameter output will change by


Gradient descent:

Given a
fixed-point on a function,
move in the direction
opposite of the gradient
Gradient descent:

update rule:
Backpropagation
Training the world’s smallest perceptron
This is just gradient
descent, that means…

this should be the gradient of


the loss function

=
Now where does this come from?
Compute the derivative

just shorthand

That means the weight update for gradient descent is:

move in direction of negative gradient


Gradient Descent (world’s smallest perceptron)

For each sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update
world’s (second) smallest
perceptron!

y = w1. x1 + w2 x2
.

function of two parameters!


Gradient Descent

For each sample

1. Predict

a. Forward pass
we just need to compute partial
b. Compute Loss derivatives for this network

2. Update

a. Back Propagation

b. Gradient update
Derivative computation

^ ^

y^ = w1. x1 + w2 x2
.
Derivative computation

^ ^

Gradient Update
Gradient Descent

For each sample

1. Predict

a. Forward pass
(side computation to track loss. not
b. Compute Loss needed for backprop)

two lines now


2. Update

a. Back Propagation

b. Gradient update
(adjustable step size)
We haven’t seen a lot of ‘propagation’ yet
because our perceptrons only had one layer…
multi-layer perceptron

function of FOUR parameters and FOUR layers!


sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4
sum activation activation activation
input weight weight weight

input hidden hidden output


layer 1 layer 2 layer 3 layer 4

^
Entire network can be written out as one long equation
^

We need to train the network:


What is known? What is unknown?
Entire network can be written out as a long equation
^

known

We need to train the network:


What is known? What is unknown?
Entire network can be written out as a long equation
^

unknown

We need to train the network:


What is known? What is unknown?
Learning an MLP

Given a set of samples and a MLP

Estimate the parameters of the MLP


Gradient Descent

For each random sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation
vector of parameter partial derivatives

b. Gradient update
vector of parameter update equations
So we need to compute the partial derivatives

^
Remember,
Partial derivative describes…

(loss layer)

So, how do you compute it?


The Chain Rule
According to the chain rule…

Intuitively, the effect of weight on loss function :

rest of the network

depends on
depends on
depends on
rest of the network

Chain Rule!
rest of the network

Just the partial


derivative of L2 loss
rest of the network

Let’s use a Sigmoid function


rest of the network

Let’s use a Sigmoid function


rest of the network
already computed.
re-use (propagate)!
The Chain Rule

a.k.a. backpropagation
The chain rule says…

depends on

depends on depends on depends on depends on


depends on
depends on
The chain rule says…

depends on

depends on depends on depends on depends on


depends on
depends on

already computed.
re-use (propagate)!
depends on

depends on depends on depends on depends on


depends on
depends on
depends on

depends on depends on depends on depends on


depends on
depends on
depends on

depends on depends on depends on depends on


depends on
depends on
Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation

b. Gradient update
Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation
vector of parameter partial derivatives

-
b. Gradient update
vector of parameter update equations
Stochastic gradient
descent
What we are truly minimizing:
𝑁𝑁

min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))


θ
𝑖𝑖=1

The gradient is:


What we are truly minimizing:
𝑁𝑁

min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))


θ
𝑖𝑖=1

The gradient is:


𝑁𝑁
𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))

𝜕𝜕𝜕
𝑖𝑖=1

What we use for gradient update is:


What we are truly minimizing:
𝑁𝑁

min � 𝐿𝐿(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))


θ
𝑖𝑖=1

The gradient is:


𝑁𝑁
𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))

𝜕𝜕𝜕
𝑖𝑖=1

What we use for gradient update is:

𝜕𝜕𝜕𝜕(𝑦𝑦𝑖𝑖 , 𝑓𝑓𝑀𝑀𝑀𝑀𝑀𝑀 (𝑥𝑥𝑖𝑖 ))


for some i
𝜕𝜕𝜕
Stochastic Gradient Descent

For each example sample

1. Predict

a. Forward pass

b. Compute Loss

2. Update

a. Back Propagation
vector of parameter partial derivatives

-
b. Gradient update
vector of parameter update equations
How do we select which sample?
How do we select which sample?

• Select randomly!

Do we need to use only one sample?


How do we select which sample?

• Select randomly!

Do we need to use only one sample?


• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?
How do we select which sample?

• Select randomly!

Do we need to use only one sample?


• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?

• Bad convergence.

Do I lose anything by using stochastic GD?


How do we select which sample?

• Select randomly!

Do we need to use only one sample?


• You can use a minibatch of size B < N.
Why not do gradient descent with all samples?

• Bad convergence.

Do I lose anything by using stochastic GD?


• Same convergence guarantees and complexity!
• Better generalization.
References

Basic reading: No standard textbooks yet! Some good resources:


• https://sites.google.com/site/deeplearningsummerschool/
• http://www.deeplearningbook.org/
• http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf

You might also like