1.4+Computing+Gradient+Using+Backpropagation
1.4+Computing+Gradient+Using+Backpropagation
Now let's understand how to apply gradient descent to a deep neural network. We
have a cost function that we'd like to minimize using gradient descent.
So far when we talked about perceptrons we described them as linear plus nonlinear.
We took the threshold function to be our non-linearity, but there are many other
reasonable choices, like the Sigmoid function. The calculation is similar in the case of
sigmoid, the perceptron computes a linear function of their inputs using their weights,
adds in a constant term called the bias, and feeds this value into a sigmoid function to
get the output.
[email protected]
ZV0GDF798E
One of the reasons it will be important to work with Sigmoids is that these functions
are smooth and differentiable, whereas the derivative of the threshold is either zero or
undefined. So the value of the Sigmoid ranges from 0 to 1 and the derivative of the
Sigmoid lies in the range of 0 to 0.5.
This small derivative and the fact that it’s a fraction between 0 and 1 is one of the
reasons for not using the Sigmoid function in the hidden layers of a deep neural
network.
1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The gradient is a measure of how changes to the weights and biases change the value
of the cost function, and we know we want to move in the direction opposite to the
gradient because we want to minimize the cost function.
[email protected]
ZV0GDF798E
2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
4. Use the new weights for prediction and to calculate the new cost.
5. Repeat steps 2 and 3 till further adjustments to weights don’t significantly
reduce the error.
[email protected]
ZV0GDF798E
Let's get into the intuition - imagine the m outputs of our neural network are 𝑎1,𝑎2....𝑎𝑚
and these are parameters that we feed into the quadratic cost function. Then it's
straightforward to compute the partial derivative of the cost function with respect to
any of these parameters. The partial derivative of a function of several variables is its
derivative with respect to one of those variables, with the others held constant (as
opposed to the total derivative, in which all variables are allowed to vary). Here we are
going to find a gradient for each of the parameters. This gradient says how much the
parameter needs to be changed to get to the global minima.
3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Now the idea is that these m outputs are themselves based on the weights in the
previous layer. If we think about the ith output, it's a function of the weights coming into
the ith perceptron, in the last layer, and also its bias - we can compute again how
changes in these weights and biases affect the ith output, this is exactly where we need
the non-linear function. In our case, the Sigmoid is differentiable. Backpropagation
continues in this manner computing the partial derivatives of how the cost function
changes as we vary the weights and the element layer based on the partial derivatives
that we've computed for the weights and biases in the layer above. That's
backpropagation.
Now that we have all the tools we need to apply deep learning, let’s understand all the
steps involved in creating neural networks:
4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
2. Another factor may be, what it means for a point to be a local minimum is much
more stringent in higher dimensions than in lower dimensions. A point is a local
minimum if in every direction you try to move, the function starts to increase.
Intuitively, it seems much harder to be a local minimum in higher dimensions
because there are so many directions which you could escape from.
3. Yet another takeaway is that when you apply backpropagation to fit the
parameters, you might get very different answers depending on where you
start. This just doesn't happen with convex functions, because wherever you
start from you'll always end up in the same place, like a globally optimal
solution.
Also, neural networks first became popular in the 1980s, but computers just weren't
powerful enough back then for us to implement and understand the power of deep
neural networks, so it was only possible to work with fairly small neural networks. The
truth is that if you're stuck with a small neural network, and you want to solve some
classification task, there are much better machine learning approaches such as
Ensemble methods and Support Vector Machines. But in recent times, vast advances in
[email protected]
ZV0GDF798Ecomputing power have made it feasible to work with truly huge neural networks, and
this has been a major driving force behind their research.
5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.