0% found this document useful (0 votes)
27 views38 pages

9 Neural Networks Learning

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views38 pages

9 Neural Networks Learning

Uploaded by

Mehar Hassan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Introduction to

Machine Learning
Dr. Muhammad Amjad Iqbal
Associate Professor
University of Central Punjab, Lahore.
[email protected]

https://sites.google.com/a/ucp.edu.pk/mai/iml/
Slides of Prof. Dr. Andrew Ng, Stanford & Dr. Humayoun
Neural Networks: Learning
Cost function
• NNs - one of the most powerful learning algorithms
• Going to study a learning algorithm for fitting the derived
parameters in NN given a training set
• First things first: neural network cost function
• We are focusing on application of NNs for classification
problems
2
Neural Network (Classification)
total no. of layers in network
no. of units (not counting bias unit) in
layer
Layer 1 Layer 2 Layer 3 Layer 4 𝑠1=3 , 𝑠 2=5 , 𝑠4 =𝑠 𝐿 =4
Multi-class classification (K classes)
Binary classification
E.g. , , ,

pedestrian car motorcycle truck

1 output unit K output units

𝑠 𝐿 =1 , 𝐾 = 1
3
Cost function: Generalization of Logistic
regression
Logistic regression:

Neural network:

4
one two three four

K = 10

Example: if x(i) is an image of the


digit 5, then the corresponding y(i)
(that you should use with the cost
function) should be a 10-
dimensional
vector with y5 = 1, and the other
elements equal to 0.
Cost function: Generalization of Logistic
We don’t include “bias unit’s
regression
Logistic regression:

Neural network:

We don’t include “bias unit’s 6


Layer 1 Layer 2 Layer 3 Layer 4
no. of units (not counting bias unit) in layer
)

𝑙=1

𝑠1= 1 This is also called


a weight decay term
Neural Networks: Learning
Backpropagation algorithm
For minimization of cost function

8
Gradient computation

Need code to compute:


-
-
9
Gradient computation
Given one training example ( , ):
Forward propagation:

Layer 1 Layer 2 Layer 3 Layer 4

10
Gradient computation: Backpropagation algorithm
• Already studied forward propagation
• Takes the initial input into a neural network and pushes
the input through the network
• It leads to the generation of an output
hypothesis, which may be a single real number (), but can
also be a vector ()

Now: back propagation algorithm


11
Gradient computation: Backpropagation algorithm
Given one training example ( , ):

Intuition: “error” of node in layer . Layer 1 Layer 2 Layer 3 Layer 4

• Back propagation basically takes the


output you got from your network (𝑙 ) (𝑙)
𝛿 =𝑎 − 𝑦 𝑗
𝑗 𝑗
• Compares it to the real value (y)
and calculates how wrong the network
was (i.e. how wrong the parameters
were)
12
Gradient computation: Backpropagation algorithm
Intuition: “error” of node in layer .

For each output unit (layer L = 4)


Layer 1 Layer 2 Layer 3 Layer 4

All vectors
(2 ) ( 2)
𝑎 . ∗(1 −𝑎 )
𝑛𝑜 𝛿(1) 𝑡𝑒𝑟𝑚
Derivative of the activation function g (sigmoid)
13
Gradient computation: Backpropagation algorithm
Intuition: “error” of node in layer .

For each output unit (layer L = 4)

Layer 1 Layer 2 Layer 3 Layer 4

15
Backpropagation
Updates.
Gradient computation: Backpropagation algorithm

( 𝟐) 𝑻
=( Ɵ ) 𝜹
(𝟐) (𝟑) (𝟐) ( 𝟐)
𝜹 .∗( 𝒂 . ∗(𝟏− 𝒂 ))

Mathematically, It is possible to prove that all these partial derivatives


are exactly given by the above formula: activations * deltas
Backpropagation algorithm Accumulator

Training set of examples


Set (for all ).
For
Set
Perform forward propagation to compute for
Using , compute
Compute

18
• We have calculated the partial derivative for each
parameter
• We can now use this “gradient” in gradient descent or
one of the advanced optimization algorithms

• Probably too many details


• Probably less mathematically clean
• If not yet clear, you’ll understand it by doing
programming exercise 4.
Neural Networks
Learning
Backpropagation intuition

21
Forward Propagation

22
Forward Propagation

(2 )
Θ 10
(𝑖) (2) (2) (3) (3) (4 ) (4)
𝑧 →𝑎 (2 ) 𝑧 →𝑎 𝑧 →𝑎
1 1 Θ 1
11 (2 )
1 1 1

Θ 12
(𝑖) (2) (2) (3) (3)
𝑧 →𝑎
2 2
𝑧 →𝑎
2 2

(𝑖 ) (𝑖 )
(𝑥 , 𝑦 )
(3) (2 ) (2) (2) (2) (2)
𝑧 =Θ × 1+Θ × 𝑎 +Θ ×𝑎
1 10 11 1 12 2 23
What is backpropagation doing? When K=1

Focusing on a single example , , the case of 1 output unit,


and ignoring regularization ( ),

(Think of )
I.e. how well is the network doing on example i?
24
Forward Propagation

“error” of cost for (unit in layer ).


Formally, (for ), where

25
Forward Propagation

“error” of cost for (unit in layer ).

26
Neural Networks
Learning
Implementation note: Unrolling parameters

27
Neural Networks
Learning
Gradient checking

33
Motivation
• Backpropagation has a lot of details
– Small bugs may get in and ruin it
• It may looks like is decreasing, but in reality it
may not be decreasing by as much as it should
• Gradient checking helps to make sure that an
implementation is working correctly
34
Numerical estimation of gradients

Two sided difference One sided difference

Implement: gradApprox = (J(theta + EPSILON) – J(theta –


EPSILON)) /(2*EPSILON) 35
Parameter vector
(E.g. is “unrolled” version of )

36
Implementation Note:
- Implement backprop to compute DVec (unrolled ).
- Implement numerical gradient check to compute gradApprox.
- Make sure they give similar values.
- Turn off gradient checking. Using backprop code for learning.

Important:
- Be sure to disable your gradient checking code before training
your classifier. If you run numerical gradient computation on
every iteration of gradient descent (or in the inner loop of
costFunction(…))your code will be very slow.

38
Neural Networks
Learning
Random initialization

39
All hidden units computing the same thing
Zero initialization Highly redundant features
( 1)
Θ 10
(𝟐)
( 1)
Θ 20 𝚯 𝟏𝟏

𝚯(𝟐)
𝟏𝟐

Also ( 1) (1)
Θ =Θ
10 20

𝜹 𝜹
( 𝟏)
𝑱 (𝚯 )= (𝟏 )
𝑱 (𝚯 )
𝜹 𝚯𝟏𝟎 𝜹 𝚯 𝟐𝟎
After each update, parameters corresponding to inputs going into each of
two hidden units are identical.
41
Neural Networks
Learning
Putting it together

43
Training a neural network
Pick a network architecture (connectivity pattern between neurons)

No. of input units: Dimension of features


No. output units: Number of classes
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden
units in every layer (usually the more the better) But expensive computationally

• Units in hidden layer should be comparable to input features


• 1.5 or 2 x number of input features

44
Training a neural network
1. Randomly initialize weights
2. Implement forward propagation to get for any
3. Implement code to compute cost function
4. Implement backprop to compute partial derivatives
for i = 1:m
Perform forward propagation and backpropagation using
example
(Get activations and delta terms for ).

45
Training a neural network
5. Use gradient checking to compare computed using
backpropagation vs. using numerical estimate of gradient
of .
Then disable gradient checking code.
6. Use gradient descent or advanced optimization method with
backpropagation to try to minimize as a function of
parameters

• In theory can get stuck in local optima


• In practice, its not usually a huge problem
• Even if we don’t find global optima, gradient descent will do a decent job
46
END

You might also like