0% found this document useful (0 votes)
6 views25 pages

Neural Network 1704953886

Uploaded by

praveen88.er
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views25 pages

Neural Network 1704953886

Uploaded by

praveen88.er
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Part 1: Neural Networks - Representation

Neural networks - Overview and summary

Why do we need neural networks?

 Say we have a complex supervised learning classification problem



o Can use logistic regression with many polynomial terms
o Works well when you have 1-2 features
o If you have 100 features

 e.g. our housing example


o 100 house features, predict odds of a house being sold in the next 6 months
o Here, if you included all the quadratic terms (second order)
 There are lots of them (x12 ,x1x2, x1x4 ..., x1x100)
 For the case of n = 100, you have about 5000 features
 Number of features grows O(n2)
 This would be computationally expensive to work with as a feature set
 A way around this to only include a subset of features
o However, if you don't have enough features, often a model won't let you fit a complex dataset
 If you include the cubic terms
o e.g. (x12x2, x1x2x3, x1x4x23 etc)
o There are even more features grows O(n3)
o About 170 000 features for n = 100
 Not a good way to build classifiers when n is large

Example: Problems where n is large - computer vision

 Computer vision sees a matrix of pixel intensity values


o Look at matrix - explain what those numbers represent
 To build a car detector
o Build a training set of
 Not cars
 Cars
o Then test against a car
 How can we do this

o Plot two pixels (two pixel locations)
o Plot car or not car on the graph
 Need a non-linear hypothesis to separate the classes
 Feature space
o If we used 50 x 50 pixels --> 2500 pixels, so n = 2500
o If RGB then 7500
o If 100 x 100 RB then --> 50 000 000 features
 Too big - wayyy too big

o So - simple logistic regression here is not appropriate for large complex systems
o Neural networks are much better for a complex nonlinear hypothesis even when feature space is
huge

Neurons and the brain

 Neural networks (NNs) were originally motivated by looking at machines which replicate the brain's
functionality
o Looked at here as a machine learning technique
 Origins

o To build learning systems, why not mimic the brain?
o Used a lot in the 80s and 90s
o Popularity diminished in late 90s
o Recent major resurgence
o
 NNs are computationally expensive, so only recently large scale neural networks became
computationally feasible
 Brain

o Does loads of crazy things
o
 Hypothesis is that the brain has a single learning algorithm
o Evidence for hypothesis
 Auditory cortex --> takes sound signals
 If you cut the wiring from the ear to the auditory cortex
 Re-route optic nerve to the auditory cortex
 Auditory cortex learns to see
 Somatosensory context (touch processing)
 If you rewrite optic nerve to somatosensory cortex then it learns to see
o With different tissue learning to see, maybe they all learn in the same way
o
 Brain learns by itself how to learn
 Other examples
o Seeing with your tongue
 Brainport
 Grayscale camera on head
 Run wire to array of electrodes on tongue
 Pulses onto tongue represent image signal
 Lets people see with their tongue
o Human echolocation
 Blind people being trained in schools to interpret sound and echo
 Lets them move around
o Haptic belt direction sense
 Belt which buzzes towards north
 Gives you a sense of direction
 Brain can process and learn from data from any source

Model representation 1
 How do we represent neural networks (NNs)?

o Neural networks were developed as a way to simulate networks of neurones
 What does a neurone look like

 Three things to notice


o Cell body
o Number of input wires (dendrites)
o Output wire (axon)
 Simple level
o Neurone gets one or more inputs through dendrites
o Does processing
o Sends output down axon
 Neurons communicate through electric spikes

o Pulse of electricity via axon to another neurone

Artificial neural network - representation of a neurone

 In an artificial neural network, a neurone is a logistic unit


o Feed input via input wires
o Logistic unit does computation
o Sends output down output wires
 That logistic computation is just like our previous logistic regression hypothesis calculation
 Very simple model of a neuron's computation
o Often good to include an x0 input - the bias unit
 This is equal to 1
 This is an artificial neurone with a sigmoid (logistic) activation function
o Ɵ vector may also be called the weights of a model
 The above diagram is a single neurone

o Below we have a group of neurones strung together

 Here, input is x1, x2 and x3


o We could also call input activation on the first layer - i.e. (a11, a21 and a31 )
o Three neurones in layer 2 (a12, a22 and a32 )
o Final fourth neurone which produces the output
 Which again we *could* call a13
 First layer is the input layer
 Final layer is the output layer - produces value computed by a hypothesis
 Middle layer(s) are called the hidden layers

o You don't observe the values processed in the hidden layer
o Not a great name
o Can have many hidden layers

Neural networks - notation

 ai(j) - activation of unit i in layer j


o So, a12 - is the activation of the 1st unit in the second layer
o By activation, we mean the value which is computed and output by that node
 Ɵ(j) - matrix of parameters controlling the function mapping from layer j to layer j + 1
o Parameters for controlling mapping from one layer to the next
o If network has
 sj units in layer j and
 sj+1 units in layer j + 1
 Then Ɵj will be of dimensions [sj+1 X sj + 1]
 Because
 sj+1 is equal to the number of units in layer (j + 1)
 is equal to the number of units in layer j, plus an additional unit
o Looking at the Ɵ matrix
 Column length is the number of units in the following layer
 Row length is the number of units in the current layer + 1 (because we have to map the
bias unit)
 So, if we had two layers - 101 and 21 units in each
 Then Ɵj would be = [21 x 102]
 What are the computations which occur?
o We have to calculate the activation for each node
o That activation depends on
 The input(s) to the node
 The parameter associated with that node (from the Ɵ vector associated with that layer)
 Below we have an example of a network, with the associated calculations for the four nodes below

 As you can see


o We calculate each of the layer-2 activations based on the input values with the bias term (which is
equal to 1)
 i.e. x0 to x3
o We then calculate the final hypothesis (i.e. the single node in layer 3) using exactly the same logic,
except in input is not x values, but the activation values from the preceding layer
 The activation value on each hidden unit (e.g. a12 ) is equal to the sigmoid function applied to the linear
combination of inputs

o Three input units
 So Ɵ(1) is the matrix of parameters governing the mapping of the input units to hidden
units
 Ɵ(1) here is a [3 x 4] dimensional matrix
o Three hidden units
 Then Ɵ(2) is the matrix of parameters governing the mapping of the hidden layer to the
output layer
 Ɵ(2) here is a [1 x 4] dimensional matrix (i.e. a row vector)
o One output unit
 Something conceptually important (that I hadn't really grasped the first time) is that
o Every input/activation goes to every node in following layer
 Which means each "layer transition" uses a matrix of parameters with the following
significance
 For the sake of consistency with later nomenclature, we're using j,i and l as our
variables here (although later in this section we use j to show the layer we're on)
 Ɵjil
 j (first of two subscript numbers)= ranges from 1 to the number of units
in layer l+1
 i (second of two subscript numbers) = ranges from 0 to the number of
units in layer l
 l is the layer you're moving FROM
 This is perhaps more clearly shown in my slightly over the top example below

 For example
o Ɵ131 = means
 1 - we're mapping to node 1 in layer l+1
 3 - we're mapping from node 3 in layer l
 1 - we're mapping from layer 1

Model representation II
Here we'll look at how to carry out the computation efficiently through a vectorized implementation. We'll also
consider
why NNs are good and how we can use them to learn complex non-linear things

 Below is our original problem from before



o Sequence of steps to compute output of hypothesis are the equations below
 Define some additional terms
o z12 = g(Ɵ101x0 + Ɵ111x1 + Ɵ121x2 + Ɵ131x3)
o Which means that
 a12 = g(z12)
o NB, superscript numbers are the layer associated
 Similarly, we define the others as
o z22 and z32
o These values are just a linear combination of the values
 If we look at the block we just redefined

o We can vectorize the neural network computation
o So lets define
o
 x as the feature vector x
 z2 as the vector of z values from the second layer

 z2 is a 3x1 vector
 We can vectorize the computation of the neural network as as follows in two steps
o z2 = Ɵ(1)x
 i.e. Ɵ(1) is the matrix defined above
 x is the feature vector
o a2 = g(z(2))
 To be clear, z2 is a 3x1 vecor
 a2 is also a 3x1 vector
 g() applies the sigmoid (logistic) function element wise to each member of the z2 vector
 To make the notation with input layer make sense;
o a1 = x
 a1 is the activations in the input layer
 Obviously the "activation" for the input layer is just the input!
o So we define x as a1 for clarity
 So
 a1 is the vector of inputs
 a2 is the vector of values calculated by the g(z2) function
 Having calculated then z2 vector, we need to calculate a02 for the final hypothesis calculation

 To take care of the extra bias unit add a02 = 1


o So add a02 to a2 making it a 4x1 vector
 So,
o z3 = Ɵ2a2
 This is the inner term of the above equation
o hƟ(x) = a3 = g(z3)
 This process is also called forward propagation

o Start off with activations of input unit
 i.e. the x vector as input
o Forward propagate and calculate the activation of each layer sequentially
o This is a vectorized version of this implementation

Neural networks learning its own features

 Diagram below looks a lot like logistic regression

 Layer 3 is a logistic regression node


o The hypothesis output = g(Ɵ102 a02 + Ɵ112 a12 + Ɵ122 a22 + Ɵ132 a32)
o This is just logistic regression
 The only difference is, instead of input a feature vector, the features are just
values calculated by the hidden layer
 The features a12, a22, and a32 are calculated/learned - not original features
 So the mapping from layer 1 to layer 2 (i.e. the calculations which generate the a2 features) is determined
by another set of parameters - Ɵ1
o So instead of being constrained by the original input features, a neural network can learn its own
features to feed into logistic regression
o Depending on the Ɵ1 parameters you can learn some interesting things
 Flexibility to learn whatever features it wants to feed into the final logistic regression
calculation
 So, if we compare this to previous logistic regression, you would have to
calculate your own exciting features to define the best way to classify or
describe something
 Here, we're letting the hidden layers do that, so we feed the hidden layers our
input values, and let them learn whatever gives the best final result to feed into
the final output layer
 As well as the networks already seen, other architectures (topology) are possible
o More/less nodes per layer
o More layers
o Once again, layer 2 has three hidden units, layer 3 has 2 hidden units by the time you get to the
output layer you get very interesting non-linear hypothesis
 Some of the intuitions here are complicated and hard to understand

o In the following lectures we're going to go though a detailed example to understand how to do
non-linear analysis

Neural network example - computing a complex, nonlinear function


of the input
 Non-linear classification: XOR/XNOR

o x1, x2 are binary

 Example on the right shows a simplified version of the more complex problem we're dealing with (on the
left)
 We want to learn a non-linear decision boundary to separate the positive and negative examples

y = x1 XOR x2
x1 XNOR x2

Where XNOR = NOT (x1 XOR x2)

 Positive examples when both are true and both are false
o Let's start with something a little more straight forward...
o Don't worry about how we're determining the weights (Ɵ values) for now - just get a flavor of
how NNs work

Neural Network example 1: AND function


 Simple first example

 Can we get a one-unit neural network to compute this logical AND function? (probably...)

o Add a bias unit
o Add some weights for the networks
o
 What are weights?

 Weights are the parameter values which multiply into the input nodes (i.e. Ɵ)

 Sometimes it's convenient to add the weights into the diagram


o These values are in fact just the Ɵ parameters so
 Ɵ101 = -30
 Ɵ111 = 20
 Ɵ121 = 20
o To use our original notation
 Look at the four input values

 So, as we can see, when we evaluate each of the four possible input, only (1,1) gives a positive output

Neural Network example 2: NOT function

 How about negation?


 Negation is achieved by putting a large negative weight in front of the variable you want to negative

Neural Network example 3: XNOR function

 So how do we make the XNOR function work?


o XNOR is short for NOT XOR
 i.e. NOT an exclusive or, so either go big (1,1) or go home (0,0)
o So we want to structure this so the input which produce a positive output are
 AND (i.e. both true)
OR
 Neither (which we can shortcut by saying not only one being true)
 So we combine these into a neural network as shown below;

 Simplez!

Neural network intuition - handwritten digit classification

 Yann LeCun = machine learning pioneer


 Early machine learning system was postcode reading

o Hilarious music, impressive demonstration!

Multiclass classification
 Multiclass classification is, unsurprisingly, when you distinguish between more than two categories (i.e.
more than 1 or 0)
 With handwritten digital recognition problem - 10 possible categories (0-9)
o How do you do that?
o Done using an extension of one vs. all classification
 Recognizing pedestrian, car, motorbike or truck
o Build a neural network with four output units
o Output a vector of four numbers
 1 is 0/1 pedestrian
 2 is 0/1 car
 3 is 0/1 motorcycle
 4 is 0/1 truck
o When image is a pedestrian get [1,0,0,0] and so on
 Just like one vs. all described earlier

o Here we have four logistic regression classifiers

 Training set here is images of our four classifications



o While previously we'd written y as an integer {1,2,3,4}
o Now represent y as

Part 2: Neural Networks - Learning


Neural network cost function

 NNs - one of the most powerful learning algorithms


o Is a learning algorithm for fitting the derived parameters given a training set
o Let's have a first look at a neural network cost function
 Focus on application of NNs for classification problems
 Here's the set up

o Training set is {(x1, y1), (x2, y2), (x3, y3) ... (xm, ym)
o L = number of layers in the network
 In our example below L = 4
o sl = number of units (not counting bias unit) in layer l
 So here

o l =4
o s1 = 3
o s2 = 5
o s3 = 5
o s4 = 4

Types of classification problems with NNs

 Two types of classification, as we've previously seen


 Binary classification
o 1 output (0 or 1)
o So single output node - value is going to be a real number
o k=1
 NB k is number of units in output layer
o sL = 1
 Multi-class classification

o k distinct classifications
o Typically k is greater than or equal to three
o If only two just go for binary
o sL = k
o So y is a k-dimensional vector of real numbers

Cost function for neural networks

 The (regularized) logistic regression cost function is as follows;

 For neural networks our cost function is a generalization of this equation above, so instead of one output we
generate k outputs

 Our cost function now outputs a k dimensional vector


o hƟ(x) is a k dimensional vector, so hƟ(x)i refers to the ith value in that vector
 Costfunction J(Ɵ) is

o [-1/m] times a sum of a similar term to which we had for logic regression
o But now this is also a sum from k = 1 through to K (K is number of output nodes)
 Summation is a sum over the k output units - i.e. for each of the possible classes
 So if we had 4 output units then the sum is k = 1 to 4 of the logistic regression over each
of the four output units in turn
o This looks really complicated, but it's not so difficult
 We don't sum over the bias terms (hence starting at 1 for the summation)
 Even if you do and end up regularizing the bias term this is not a big problem
 Is just summation over the terms

Woah there - lets take a second to try and understand this!

 There are basically two halves to the neural network logistic regression cost function

First half

 This is just saying


o For each training data example (i.e. 1 to m - the first summation)
 Sum for each position in the output vector
 This is an average sum of logistic regression

Second half

 This is a massive regularization summation term, which I'm not going to walk through, but it's a
fairly straightforward triple nested summation
 This is also called a weight decay term
 As before, the lambda value determines the important of the two halves

 The regularization term is similar to that in logistic regression


 So, we have a cost function, but how do we minimize this bad boy?!

Summary of what's about to go down


The following section is, I think, the most complicated thing in the course, so I'm going to take a second to explain
the general idea of what we're going to do;

 We've already described forward propagation


o This is the algorithm which takes your neural network and the initial input into that network and
pushes the input through the network
 It leads to the generation of an output hypothesis, which may be a single real number, but
can also be a vector
 We're now going to describe back propagation
o Back propagation basically takes the output you got from your network, compares it to the real
value (y) and calculates how wrong the network was (i.e. how wrong the parameters were)
o It then, using the error you've just calculated, back-calculates the error associated with each unit
from the preceding layer (i.e. layer L - 1)
o This goes on until you reach the input layer (where obviously there is no error, as the activation is
the input)
o These "error" measurements for each unit can be used to calculate the partial derivatives
 Partial derivatives are the bomb, because gradient descent needs them to minimize the
cost function
o We use the partial derivatives with gradient descent to try minimize the cost function and update
all the Ɵ values
o This repeats until gradient descent reports convergence

 A few things which are good to realize from the get go


o There is a Ɵ matrix for each layer in the network
 This has each node in layer l as one dimension and each node in l+1 as the other
dimension
o Similarly, there is going to be a Δ matrix for each layer
 This has each node as one dimension and each training data example as the other

Back propagation algorithm


 We previously spoke about the neural network cost function
 Now we're going to deal with back propagation
o Algorithm used to minimize the cost function, as it allows us to calculate partial derivatives!

 The cost function used is shown above



o We want to find parameters Ɵ which minimize J(Ɵ)
o To do so we can use one of the algorithms already described such as
 Gradient descent
 Advanced optimization algorithms
 To minimize a cost function we just write code which computes the following
o J(Ɵ)
 i.e. the cost function itself!
 Use the formula above to calculate this value, so we've done that
o Partial derivative terms
 So now we need some way to do that
 This is not trivial! Ɵ is indexed in three dimensions because we
have separate parameter values for each node in each layer going to each node in
the following layer
 i.e. each layer has a Ɵ matrix associated with it!
 We want to calculate the partial derivative Ɵ with respect to a single
parameter

 Remember that the partial derivative term we calculate above is a REAL number (not a
vector or a matrix)
 Ɵ is the input parameters
 Ɵ1 is the matrix of weights which define the function mapping from
layer 1 to layer 2
 Ɵ101 is the real number parameter which you multiply the bias unit (i.e.
1) with for the bias unit input into the first unit in the second layer
 Ɵ111 is the real number parameter which you multiply the first (real)
unit with for the first input into the first unit in the second layer
 Ɵ211 is the real number parameter which you multiply the first (real)
unit with for the first input into the second unit in the second layer
 As discussed, Ɵijl i
 i here represents the unit in layer l+1 you're mapping to
(destination node)
 j is the unit in layer l you're mapping from (origin node)
 l is the layer your mapping from (to layer l+1) (origin layer)
 NB
 The terms destination node, origin node and origin
layer are terms I've made up!
 So - this partial derivative term is
 The partial derivative of a 3-way indexed dataset with respect to a real number
(which is one of the values in that dataset)
o Gradient computation
o
 One training example
 Imagine we just have a single pair (x,y) - entire training set
 How would we deal with this example?
 The forward propagation algorithm operates as follows

 Layer 1
 a1 = x
 z2 = Ɵ1a1
 Layer 2
 a2 = g(z2) (add a02)
 z3 = Ɵ2a2
 Layer 3
 a3 = g(z3) (add a03)
 z4 = Ɵ3a3
 Output

 a4 = hƟ(x) = g(z4)


o This is the vectorized implementation of forward propagation
 Lets compute activation values sequentially (below just re-iterates what we had above!)
What is back propagation?

 Use it to compute the partial derivatives


 Before we dive into the mechanics, let's get an idea regarding the intuition of the algorithm
o For each node we can calculate (δjl) - this is the error of node j in layer l
 If we remember, ajl is the activation of node j in layer l
 Remember the activation is a totally calculated value, so we'd expect there to be some
error compared to the "real" value
 The delta term captures this error
 But the problem here is, "what is this 'real' value, and how do we calculate it?!"
 The NN is a totally artificial construct
 The only "real" value we have is our actual classification (our y value) -
so that's where we start
 If we use our example and look at the fourth (output) layer, we can first calculate
o δj4 = aj4 - yj
 [Activation of the unit] - [the actual value observed in the training example]
 We could also write aj4 as hƟ(x)j
 Although I'm not sure why we would?
o This is an individual example implementation
 Instead of focussing on each node, let's think about this as a vectorized problem
o δ4 = a4 - y
 So here δ4 is the vector of errors for the 4th layer
 a4 is the vector of activation values for the 4th layer
 With δ calculated, we can determine the error terms for the other layers as follows;
4

 Taking a second to break this down


o Ɵ3 is the vector of parameters for the 3->4 layer mapping
o δ4 is (as calculated) the error vector for the 4th layer
o g'(z3) is the first derivative of the activation function g evaluated by the input values given by z3
 You can do the calculus if you want (...), but when you calculate this derivative you get
 g'(z3) = a3 . * (1 - a3)
o So, more easily
 δ3 = (Ɵ3)T δ4 . *(a3 . * (1 - a3))
o . * is the element wise multiplication between the two vectors
 Why element wise? Because this is essentially an extension of individual values in a
vectorized implementation, so element wise multiplication gives that effect
 We highlighted it just in case you think it's a typo!
Analyzing the mathematics

 And if we take a second to consider the vector dimensionality (with our example above [3-5-5-4])
o Ɵ3 = is a matrix which is [4 X 5] (if we don't include the bias term, 4 X 6 if we do)
 (Ɵ3)T = therefore, is a [5 X 4] matrix
o δ4 = is a 4x1 vector
o So when we multiply a [5 X 4] matrix with a [4 X 1] vector we get a [5 X 1] vector
o Which, low and behold, is the same dimensionality as the a3 vector, meaning we can run our
pairwise multiplication

 For δ3 when you calculate the derivative terms you get


a3 . * (1 - a3)
 Similarly For δ2 when you calculate the derivative terms you get
a2 . * (1 - a2)
o So to calculate δ2 we do
δ2 = (Ɵ2)T δ3 . *(a2 . * (1 - a2))
 There's no δ1 term
o Because that was the input!

Why do we do this?

 We do all this to get all the δ terms, and we want the δ terms because through a very complicated derivation
you can use δ to get the partial derivative of Ɵ with respect to individual parameters (if you ignore
regularization, or regularization is 0, which we deal with later)

 = ajl δi(l+1)
 By doing back propagation and computing the delta terms you can then compute the partial derivative
terms
o We need the partial derivatives to minimize the cost function!

Putting it all together to get the partial derivatives!

 What is really happening - lets look at a more complex example


 Training set of m examples

 First, set the delta values

o Set equal to 0 for all values


o Eventually these Δ values will be used to compute the partial derivative
 Will be used as accumulators for computing the partial derivatives
 Next, loop through the training set

o i.e. for each example in the training set (dealing with each example as (x,y)
o Set a1 (activation of input layer) = xi
o Perform forward propagation to compute al for each layer (l = 1,2, ... L)
 i.e. run forward propagation
o Then, use the output label for the specific example we're looking at to calculate δL where δL = aL -
yi
 So we initially calculate the delta value for the output layer
 Then, using back propagation we move back through the network from layer L-1 down
to layer
o Finally, use Δ to accumulate the partial derivative terms

o Note here
 l = layer
 j = node in that layer
 i = the example index
o You can vectorize the Δ expression too, as

 Finally

o After executing the body of the loop, exit the for loop and compute

o
 When j = 0 we have no regularization term
 At the end of ALL this

o You've calculated all the D terms above using Δ
 NB - each D term above is a real number!
o We can show that each D is equal to the following


o We have calculated the partial derivative for each parameter
o
 We can then use these in gradient descent or one of the advanced optimization algorithms
 Phew!
o What a load of hassle!

Back propagation intuition

 Some additionally back propagation notes


o In case you found the preceding unclear, which it shouldn't be as it's fairly heavily modified with
my own explanatory notes
 Back propagation is hard(ish...)
o But don't let that discourage you
o It's hard in as much as it's confusing - it's not difficult, just complex
 Looking at mechanical steps of back propagation

Forward propagation with pictures!


 Feeding input into the input layer (xi, yi)
o Note that x and y here are vectors from 1 to n where n is the number of features
 So above, our data has two features (hence x1 and x2)
 With out input data present we use forward propagation

 The sigmoid function applied to the z values gives the activation values
o Below we show exactly how the z value is calculated for an example

Back propagation

 With forwardprop done we move on to do back propagation


 Back propagation is doing something very similar to forward propagation, but backwards
o Very similar though
 Let's look at the cost function again...
o Below we have the cost function if there is a single output (i.e. binary classification)
 This function cycles over each example, so the cost for one example really boils down to this

 Which, we can think of as a sigmoidal version of the squared difference (check out the derivation if you
don't believe me)
o So, basically saying, "how well is the network doing on example i "?
 We can think about a δ term on a unit as the "error" of cost for the activation value associated with a unit
o More formally (don't worry about this...), δ is

 Where cost is as defined above


 Cost function is a function of y value and the hypothesis function
 So - for the output layer, back propagation sets the δ value as [a - y]
o Difference between activation and actual value
 We then propagate these values backwards;

 Looking at another example to see how we actually calculate the delta value;

 So, in effect,
o Back propagation calculates the δ, and those δ values are the weighted sum of the next layer's
delta values, weighted by the parameter associated with the links
o Forward propagation calculates the activation (a) values, which
 Depending on how you implement you may compute the delta values of the bias values

o However, these aren't actually used, so it's a bit inefficient, but not a lot more!
Implementation notes - unrolling parameters (matrices)

 Needed for using advanced optimization routines

 Is the MATLAB/octave code


o But theta is going to be matrices
 fminunc takes the costfunction and initial theta values
o These routines assume theta is a parameter vector
o Also assumes the gradient created by costFunction is a vector
 For NNs, our parameters are matrices
o e.g.

Example

 Use the thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]; notation to unroll the matrices into a long vector
 To go back you use

o Theta1 = resape(thetaVec(1:110), 10, 11)

Gradient checking

 Backpropagation has a lot of details, small bugs can be present and ruin it :-(
o This may mean it looks like J(Ɵ) is decreasing, but in reality it may not be decreasing by as much
as it should
 So using a numeric method to check the gradient can help diagnose a bug
o Gradient checking helps make sure an implementation is working correctly
 Example

o Have an function J(Ɵ)
o Estimate derivative of function at point Ɵ (where Ɵ is a real number)
o How?
o
 Numerically

 Compute Ɵ + ε
 Compute Ɵ - ε
 Join them by a straight line
 Use the slope of that line as an approximation to the derivative

 Usually, epsilon is pretty small (0.0001)


o If epsilon becomes REALLY small then the term BECOMES the slopes derivative
 The is the two sided difference (as opposed to one sided difference, which would be J(Ɵ + ε) - J(Ɵ) /ε
 If Ɵ is a vector with n elements we can use a similar approach to look at the partial derivatives

 So, in octave we use the following code the numerically compute the derivatives

 So on each loop thetaPlus = theta except for thetaPlus(i)


o Resets thetaPlus on each loop
 Create a vector of partial derivative approximations
 Using the vector of gradients from backprop (DVec)

o Check that gradApprox is basically equal to DVec
o Gives confidence that the Backproc implementation is correc
 Implementation note

o Implement back propagation to compute DVec
o Implement numerical gradient checking to compute gradApprox
o Check they're basically the same (up to a few decimal places)
o Before using the code for learning turn off gradient checking
o
 Why?

 GradAprox stuff is very computationally expensive
 In contrast backprop is much more efficient (just more fiddly)

Random initialization

 Pick random small initial values for all the theta values
o If you start them on zero (which does work for linear regression) then the algorithm fails -
all activation values for each layer are the same
 So chose random values!

o Between 0 and 1, then scale by epsilon (where epsilon is a constant)

Putting it all together

 1) - pick a network architecture



o Number of
o
 Input units - number of dimensions x (dimensions of feature vector)
 Output units - number of classes in classification problem
 Hidden units

 Default might be
 1 hidden layer
 Should probably have
 Same number of units in each layer
 Or 1.5-2 x number of input features
 Normally
 More hidden units is better
 But more is more computational expensive
o We'll discuss architecture more later

 2) - Training a neural network



o 2.1) Randomly initialize the weights
 Small values near 0
o 2.2) Implement forward propagation to get hƟ(x)i for any xi
o 2.3) Implement code to compute the cost function J(Ɵ)
o 2.4) Implement back propagation to compute the partial derivatives
o General implementation below

for i = 1:m {
Forward propagation on (xi, yi) --> get activation (a) terms
Back propagation on (xi, yi) --> get delta (δ) terms
Compute Δ := Δl + δl+1(al)T
}

With this done compute the partial derivative terms


o Notes on implementation
o
 Usually done with a for loop over training examples (for forward and back propagation)
 Can be done without a for loop, but this is a much more complicated way of doing things
 Be careful

 2.5) Use gradient checking to compare the partial derivatives computed using the above algorithm and
numerical estimation of gradient of J(Ɵ)
o Disable the gradient checking code for when you actually run it
 2.6) Use gradient descent or an advanced optimization method with back propagation to try to minimize
J(Ɵ) as a function of parameters Ɵ
o Here J(Ɵ) is non-convex
 Can be susceptible to local minimum
 In practice this is not usually a huge problem
 Can't guarantee programs with find global optimum should find good local optimum at
least

 e.g. above pretending data only has two features to easily display what's going on
o Our minimum here represents a hypothesis output which is pretty close to y
o If you took one of the peaks hypothesis is far from y
 Gradient descent will start from some random point and move downhill
o Back propagation calculates gradient down that hill

You might also like