0% found this document useful (0 votes)
3 views

1.4+Computing+Gradient+Using+Backpropagation

The document explains the application of gradient descent in deep neural networks, focusing on how to compute gradients using backpropagation to minimize a cost function. It details the steps involved in gradient descent, including weight initialization, forward propagation, and iterative optimization until optimal parameters are found. Additionally, it discusses the challenges of non-convex functions and the historical context of neural networks in relation to advancements in computing power.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

1.4+Computing+Gradient+Using+Backpropagation

The document explains the application of gradient descent in deep neural networks, focusing on how to compute gradients using backpropagation to minimize a cost function. It details the steps involved in gradient descent, including weight initialization, forward propagation, and iterative optimization until optimal parameters are found. Additionally, it discusses the challenges of non-convex functions and the historical context of neural networks in relation to advancements in computing power.

Uploaded by

Mandy Law
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Computing Gradient using Backpropagation

Now let's understand how to apply gradient descent to a deep neural network. We
have a cost function that we'd like to minimize using gradient descent.

So far when we talked about perceptrons we described them as linear plus nonlinear.
We took the threshold function to be our non-linearity, but there are many other
reasonable choices, like the Sigmoid function. The calculation is similar in the case of
sigmoid, the perceptron computes a linear function of their inputs using their weights,
adds in a constant term called the bias, and feeds this value into a sigmoid function to
get the output.

[email protected]
ZV0GDF798E

One of the reasons it will be important to work with Sigmoids is that these functions
are smooth and differentiable, whereas the derivative of the threshold is either zero or
undefined. So the value of the Sigmoid ranges from 0 to 1 and the derivative of the
Sigmoid lies in the range of 0 to 0.5.

This small derivative and the fact that it’s a fraction between 0 and 1 is one of the
reasons for not using the Sigmoid function in the hidden layers of a deep neural
network.

We need to minimize or approximate the non-convex function which comes from


computing the quadratic cost function on the output using gradient descent. This
quadratic cost function depends on all of the weights and biases inside the network.

1
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
The gradient is a measure of how changes to the weights and biases change the value
of the cost function, and we know we want to move in the direction opposite to the
gradient because we want to minimize the cost function.

Let’s quickly discuss what Gradient descent actually is.


Gradient descent is an optimization algorithm used to find the values of parameters
(coefficients) of a function (f) that minimizes a cost function (cost). You can think of it as
a large bowl. This bowl is a plot of the cost function (f). A random position on the
surface of the bowl is the cost of the current values of the coefficients (cost). The
bottom of the bowl is the cost of the best set of coefficients, the minimum of the
function. The goal is to continue to try different values for the coefficients, evaluate
their cost, and select new coefficients that have a slightly better (lower) cost.
Repeating this process enough times will lead to the bottom of the bowl and you will
know the values of the coefficients that result in the minimum cost.
You may read more about gradient descent here.

[email protected]
ZV0GDF798E

Steps in Gradient descent :


1. Initialize the weights with random values and calculate the cost function.
2. Calculate the gradient. We need to know the gradient so that we know the
direction (sign) to move the coefficient values in order to get a lower cost on the
next iteration.
3. Adjust the weights with the gradients to reach the optimal values where cost is
minimized.

new_coefficient = old_coefficient – (learning_rate * gradient)

2
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
4. Use the new weights for prediction and to calculate the new cost.
5. Repeat steps 2 and 3 till further adjustments to weights don’t significantly
reduce the error.

How do we compute the gradient?


In the context of neural networks, we can compute the gradient using what's called
backpropagation. This method helps calculate the gradient of a loss function with
respect to all the weights in the network. The intuition is much simpler than what
backpropagation is really doing, which is using the chain rule to compute the gradient
layer by layer.

[email protected]
ZV0GDF798E

Let's get into the intuition - imagine the m outputs of our neural network are 𝑎1,𝑎2....𝑎𝑚
and these are parameters that we feed into the quadratic cost function. Then it's
straightforward to compute the partial derivative of the cost function with respect to
any of these parameters. The partial derivative of a function of several variables is its
derivative with respect to one of those variables, with the others held constant (as
opposed to the total derivative, in which all variables are allowed to vary). Here we are
going to find a gradient for each of the parameters. This gradient says how much the
parameter needs to be changed to get to the global minima.

3
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
Now the idea is that these m outputs are themselves based on the weights in the
previous layer. If we think about the ith output, it's a function of the weights coming into
the ith perceptron, in the last layer, and also its bias - we can compute again how
changes in these weights and biases affect the ith output, this is exactly where we need
the non-linear function. In our case, the Sigmoid is differentiable. Backpropagation
continues in this manner computing the partial derivatives of how the cost function
changes as we vary the weights and the element layer based on the partial derivatives
that we've computed for the weights and biases in the layer above. That's
backpropagation.

Now that we have all the tools we need to apply deep learning, let’s understand all the
steps involved in creating neural networks:

1. Pick a network architecture: Decide the number of layers, the number of


perceptrons in each layer, the activation functions, and the number of output
perceptrons according to the problem statement.
2. Random Initialization of Weights: The weights are randomly initialized to a
[email protected]
ZV0GDF798E value between 0 and 1, or rather, very close to zero.
3. Implementation of Forward Propagation to calculate the cost for a set of input
vectors for any of the hidden layers. The cost function would help determine
how well the neural network fits the training data.
4. Implementation of Backpropagation to compute the gradient.
5. Using the gradient descent technique with backpropagation to try and minimize
the cost function as a function of parameters or weights.
6. Repeat steps 3 to 5 until the model finds the optimal parameters.

Why does gradient descent on a non-convex function work at all? In low


dimensions, it seems obvious that it really would get stuck in local minima. But in high
dimensions, it seems to actually work, and the truth is that no one knows why.

There are several possible explanations:


1. Maybe these functions are closer to convex than we think and are at least
convex on a large region of states.

4
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.
2. Another factor may be, what it means for a point to be a local minimum is much
more stringent in higher dimensions than in lower dimensions. A point is a local
minimum if in every direction you try to move, the function starts to increase.
Intuitively, it seems much harder to be a local minimum in higher dimensions
because there are so many directions which you could escape from.
3. Yet another takeaway is that when you apply backpropagation to fit the
parameters, you might get very different answers depending on where you
start. This just doesn't happen with convex functions, because wherever you
start from you'll always end up in the same place, like a globally optimal
solution.

Also, neural networks first became popular in the 1980s, but computers just weren't
powerful enough back then for us to implement and understand the power of deep
neural networks, so it was only possible to work with fairly small neural networks. The
truth is that if you're stuck with a small neural network, and you want to solve some
classification task, there are much better machine learning approaches such as
Ensemble methods and Support Vector Machines. But in recent times, vast advances in
[email protected]
ZV0GDF798Ecomputing power have made it feasible to work with truly huge neural networks, and
this has been a major driving force behind their research.

Lastly, about hierarchical representations - we talked about some of the philosophy


and connections to neuroscience. There is, at best, a very loose parallel between these
two domains, and you're advised not to take this comparison too literally. In the early
days, there was so much focus on doing exactly what neuroscience tells us happens in
the visual cortex, that researchers actually stayed away from gradient descent because
it wasn't and still isn't clear that the visual cortex can actually implement these types
of algorithms.

5
This file is meant for personal use by [email protected] only.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like