Neural Network 1704953886
Neural Network 1704953886
Neural networks (NNs) were originally motivated by looking at machines which replicate the brain's
functionality
o Looked at here as a machine learning technique
Origins
o To build learning systems, why not mimic the brain?
o Used a lot in the 80s and 90s
o Popularity diminished in late 90s
o Recent major resurgence
o
NNs are computationally expensive, so only recently large scale neural networks became
computationally feasible
Brain
o Does loads of crazy things
o
Hypothesis is that the brain has a single learning algorithm
o Evidence for hypothesis
Auditory cortex --> takes sound signals
If you cut the wiring from the ear to the auditory cortex
Re-route optic nerve to the auditory cortex
Auditory cortex learns to see
Somatosensory context (touch processing)
If you rewrite optic nerve to somatosensory cortex then it learns to see
o With different tissue learning to see, maybe they all learn in the same way
o
Brain learns by itself how to learn
Other examples
o Seeing with your tongue
Brainport
Grayscale camera on head
Run wire to array of electrodes on tongue
Pulses onto tongue represent image signal
Lets people see with their tongue
o Human echolocation
Blind people being trained in schools to interpret sound and echo
Lets them move around
o Haptic belt direction sense
Belt which buzzes towards north
Gives you a sense of direction
Brain can process and learn from data from any source
Model representation 1
How do we represent neural networks (NNs)?
o Neural networks were developed as a way to simulate networks of neurones
What does a neurone look like
For example
o Ɵ131 = means
1 - we're mapping to node 1 in layer l+1
3 - we're mapping from node 3 in layer l
1 - we're mapping from layer 1
Model representation II
Here we'll look at how to carry out the computation efficiently through a vectorized implementation. We'll also
consider
why NNs are good and how we can use them to learn complex non-linear things
z2 is a 3x1 vector
We can vectorize the computation of the neural network as as follows in two steps
o z2 = Ɵ(1)x
i.e. Ɵ(1) is the matrix defined above
x is the feature vector
o a2 = g(z(2))
To be clear, z2 is a 3x1 vecor
a2 is also a 3x1 vector
g() applies the sigmoid (logistic) function element wise to each member of the z2 vector
To make the notation with input layer make sense;
o a1 = x
a1 is the activations in the input layer
Obviously the "activation" for the input layer is just the input!
o So we define x as a1 for clarity
So
a1 is the vector of inputs
a2 is the vector of values calculated by the g(z2) function
Having calculated then z2 vector, we need to calculate a02 for the final hypothesis calculation
Example on the right shows a simplified version of the more complex problem we're dealing with (on the
left)
We want to learn a non-linear decision boundary to separate the positive and negative examples
y = x1 XOR x2
x1 XNOR x2
Positive examples when both are true and both are false
o Let's start with something a little more straight forward...
o Don't worry about how we're determining the weights (Ɵ values) for now - just get a flavor of
how NNs work
Can we get a one-unit neural network to compute this logical AND function? (probably...)
o Add a bias unit
o Add some weights for the networks
o
What are weights?
Weights are the parameter values which multiply into the input nodes (i.e. Ɵ)
So, as we can see, when we evaluate each of the four possible input, only (1,1) gives a positive output
Simplez!
Multiclass classification
Multiclass classification is, unsurprisingly, when you distinguish between more than two categories (i.e.
more than 1 or 0)
With handwritten digital recognition problem - 10 possible categories (0-9)
o How do you do that?
o Done using an extension of one vs. all classification
Recognizing pedestrian, car, motorbike or truck
o Build a neural network with four output units
o Output a vector of four numbers
1 is 0/1 pedestrian
2 is 0/1 car
3 is 0/1 motorcycle
4 is 0/1 truck
o When image is a pedestrian get [1,0,0,0] and so on
Just like one vs. all described earlier
o Here we have four logistic regression classifiers
For neural networks our cost function is a generalization of this equation above, so instead of one output we
generate k outputs
There are basically two halves to the neural network logistic regression cost function
First half
Second half
This is a massive regularization summation term, which I'm not going to walk through, but it's a
fairly straightforward triple nested summation
This is also called a weight decay term
As before, the lambda value determines the important of the two halves
Remember that the partial derivative term we calculate above is a REAL number (not a
vector or a matrix)
Ɵ is the input parameters
Ɵ1 is the matrix of weights which define the function mapping from
layer 1 to layer 2
Ɵ101 is the real number parameter which you multiply the bias unit (i.e.
1) with for the bias unit input into the first unit in the second layer
Ɵ111 is the real number parameter which you multiply the first (real)
unit with for the first input into the first unit in the second layer
Ɵ211 is the real number parameter which you multiply the first (real)
unit with for the first input into the second unit in the second layer
As discussed, Ɵijl i
i here represents the unit in layer l+1 you're mapping to
(destination node)
j is the unit in layer l you're mapping from (origin node)
l is the layer your mapping from (to layer l+1) (origin layer)
NB
The terms destination node, origin node and origin
layer are terms I've made up!
So - this partial derivative term is
The partial derivative of a 3-way indexed dataset with respect to a real number
(which is one of the values in that dataset)
o Gradient computation
o
One training example
Imagine we just have a single pair (x,y) - entire training set
How would we deal with this example?
The forward propagation algorithm operates as follows
Layer 1
a1 = x
z2 = Ɵ1a1
Layer 2
a2 = g(z2) (add a02)
z3 = Ɵ2a2
Layer 3
a3 = g(z3) (add a03)
z4 = Ɵ3a3
Output
a4 = hƟ(x) = g(z4)
o This is the vectorized implementation of forward propagation
Lets compute activation values sequentially (below just re-iterates what we had above!)
What is back propagation?
And if we take a second to consider the vector dimensionality (with our example above [3-5-5-4])
o Ɵ3 = is a matrix which is [4 X 5] (if we don't include the bias term, 4 X 6 if we do)
(Ɵ3)T = therefore, is a [5 X 4] matrix
o δ4 = is a 4x1 vector
o So when we multiply a [5 X 4] matrix with a [4 X 1] vector we get a [5 X 1] vector
o Which, low and behold, is the same dimensionality as the a3 vector, meaning we can run our
pairwise multiplication
Why do we do this?
We do all this to get all the δ terms, and we want the δ terms because through a very complicated derivation
you can use δ to get the partial derivative of Ɵ with respect to individual parameters (if you ignore
regularization, or regularization is 0, which we deal with later)
= ajl δi(l+1)
By doing back propagation and computing the delta terms you can then compute the partial derivative
terms
o We need the partial derivatives to minimize the cost function!
o i.e. for each example in the training set (dealing with each example as (x,y)
o Set a1 (activation of input layer) = xi
o Perform forward propagation to compute al for each layer (l = 1,2, ... L)
i.e. run forward propagation
o Then, use the output label for the specific example we're looking at to calculate δL where δL = aL -
yi
So we initially calculate the delta value for the output layer
Then, using back propagation we move back through the network from layer L-1 down
to layer
o Finally, use Δ to accumulate the partial derivative terms
o Note here
l = layer
j = node in that layer
i = the example index
o You can vectorize the Δ expression too, as
Finally
o After executing the body of the loop, exit the for loop and compute
o
When j = 0 we have no regularization term
At the end of ALL this
o You've calculated all the D terms above using Δ
NB - each D term above is a real number!
o We can show that each D is equal to the following
o We have calculated the partial derivative for each parameter
o
We can then use these in gradient descent or one of the advanced optimization algorithms
Phew!
o What a load of hassle!
The sigmoid function applied to the z values gives the activation values
o Below we show exactly how the z value is calculated for an example
Back propagation
Which, we can think of as a sigmoidal version of the squared difference (check out the derivation if you
don't believe me)
o So, basically saying, "how well is the network doing on example i "?
We can think about a δ term on a unit as the "error" of cost for the activation value associated with a unit
o More formally (don't worry about this...), δ is
Looking at another example to see how we actually calculate the delta value;
So, in effect,
o Back propagation calculates the δ, and those δ values are the weighted sum of the next layer's
delta values, weighted by the parameter associated with the links
o Forward propagation calculates the activation (a) values, which
Depending on how you implement you may compute the delta values of the bias values
o However, these aren't actually used, so it's a bit inefficient, but not a lot more!
Implementation notes - unrolling parameters (matrices)
Example
Use the thetaVec = [ Theta1(:); Theta2(:); Theta3(:)]; notation to unroll the matrices into a long vector
To go back you use
o Theta1 = resape(thetaVec(1:110), 10, 11)
Gradient checking
Backpropagation has a lot of details, small bugs can be present and ruin it :-(
o This may mean it looks like J(Ɵ) is decreasing, but in reality it may not be decreasing by as much
as it should
So using a numeric method to check the gradient can help diagnose a bug
o Gradient checking helps make sure an implementation is working correctly
Example
o Have an function J(Ɵ)
o Estimate derivative of function at point Ɵ (where Ɵ is a real number)
o How?
o
Numerically
Compute Ɵ + ε
Compute Ɵ - ε
Join them by a straight line
Use the slope of that line as an approximation to the derivative
So, in octave we use the following code the numerically compute the derivatives
Random initialization
Pick random small initial values for all the theta values
o If you start them on zero (which does work for linear regression) then the algorithm fails -
all activation values for each layer are the same
So chose random values!
o Between 0 and 1, then scale by epsilon (where epsilon is a constant)
for i = 1:m {
Forward propagation on (xi, yi) --> get activation (a) terms
Back propagation on (xi, yi) --> get delta (δ) terms
Compute Δ := Δl + δl+1(al)T
}
o Notes on implementation
o
Usually done with a for loop over training examples (for forward and back propagation)
Can be done without a for loop, but this is a much more complicated way of doing things
Be careful
2.5) Use gradient checking to compare the partial derivatives computed using the above algorithm and
numerical estimation of gradient of J(Ɵ)
o Disable the gradient checking code for when you actually run it
2.6) Use gradient descent or an advanced optimization method with back propagation to try to minimize
J(Ɵ) as a function of parameters Ɵ
o Here J(Ɵ) is non-convex
Can be susceptible to local minimum
In practice this is not usually a huge problem
Can't guarantee programs with find global optimum should find good local optimum at
least
e.g. above pretending data only has two features to easily display what's going on
o Our minimum here represents a hypothesis output which is pretty close to y
o If you took one of the peaks hypothesis is far from y
Gradient descent will start from some random point and move downhill
o Back propagation calculates gradient down that hill