0% found this document useful (0 votes)
26 views9 pages

Gradient Descent

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent. It works by calculating the derivative of the cost function to determine how to adjust the parameters to minimize the cost, repeating this process until convergence. The learning rate must be chosen carefully to ensure convergence without overshooting or diverging from the minimum.

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views9 pages

Gradient Descent

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent. It works by calculating the derivative of the cost function to determine how to adjust the parameters to minimize the cost, repeating this process until convergence. The learning rate must be chosen carefully to ensure convergence without overshooting or diverging from the minimum.

Uploaded by

Rutvik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Gradient Descent

The gradient descent algorithm is used all


over the machine learning landscape; from
optimizing Linear Regression to training
some of the most advanced neural network
models. The goal of gradient descent is to
minimize any function, and it can take any
number of parameters.

Defining the Model


The Algorithm’s Goal
The goal of the algorithm is described below, with is to minimize a general function
J , with a number of parameters n,

min J (w1 , w2 , ..., wn , b)


w1 ,...,wn ,b

As parameters increase, the resulting


minimum for J will become more
complex and can even have multiple
values. The algorithm works by starting
at the most inefficient parameter
selections (red peak spots on the
graph), then working in small
increments to find the shortest path to
the minimum J values (blue valleys).
This process is repeated many times
over for different starting points.

Notice how starting at slightly different spots on the peaks can result in completely
different paths and resulting minimums. The lowest J value on any given individual
path is called the local minima.

Gradient Descent 1
Defining the Algorithm
The gradient descent algorithm is defined as a repeated convergence for each
input parameter,


w =w−α J (w, b)
∂w

b= b−α J (w, b)
∂b
α = Learning rate, which is a value that control how big of a step is taken. This is
usually a small positive number, between 0 and 1. A large alpha value corresponds
to a bigger step and vice-versa.

∂(w,b) J (w, b) = Derivative of the cost function, which determines the direction to
take each step.

Notice how the combination of both these values gives the exact coordinate of the
next point in the shortest path. Moreover, when subtracting its value from the
previous parameter w or b, a more optimized value is found and assigned. When
the algorithm finally reaches its lowest point in a path, there will be nothing to
subtract. The convergence to 0 for every input parameter is where the algorithm
ends.

Since every parameter must be updated, it is extremely important to


simultaneously update w and b at the same time. Otherwise, an updated value will
incorrectly go into the cost function for another.

As for the derivative of the cost function, it is used


because the derivative is the slope of the tangent
line at point w on the cost function curve. This is
more easily understood if the algorithm is
simplified to plotting the cost function J (w)
against w . As this tangent line approaches the
bottom of the curve, its slope will approach 0,
which results in a smaller decrease of w in each
step of the gradient descent path.

Choosing a Learning Rate


The choice of learning rate α will have a huge impact
on the efficiency of the overall algorithm. If the learning
rate is too small, many more steps will be taken than

Gradient Descent 2
required, so the total time for the algorithm will grow
dramatically. On the other hand, if the learning rate is
too large, then steps can miss the true minimum cost
function, and the algorithm itself will actually diverge
away from it.

Now, these examples above are for simple parabolic


graphs. Optimizing α on more complex graphs can be
more tricky. For example, consider a graph where
there are several local minima separated by local
maxima.

There is also another important question to answer, which is why the algorithm
works with a fixed learning rate. This is because as the steps approach the local
minima, they automatically become smaller due to the derivative itself becoming
smaller (each subsequent slope is smaller as it approaches 0).

Optimizations
Learning curve
Recall that the goal of gradient descent
is to minimize the cost function J w ,b .
Plotting the cost function against the
number of iterations can show how
effective it is from a quick glance. This
curve is specifically called the learning
curve. Ideally, the curve should
converge to 0 as quickly as possible. If
J ever increases after an iterations, it
can often signify a poor learning rate α
selection or a bug in the code.

Moreover, the number of iterations can vary significantly depending on application,


which often makes it difficult to initially find how many iterations are needed for
convergence.

Automatic convergence test


An automatic convergence test can also be used. This essentially sets a baseline
value ϵ for how much the cost function should decrease by before stopping the

Gradient Descent 3
algorithm. If J (w , b) decreases by ≤ ϵ in one iteration, declare convergence - The
parameters w , b have been found.

Determining an optimal baseline ϵ can be very tricky, and it is often best used in
conjunction with a learning curve graph.

Debugging
While divergence from 0 or a cost function that moves up and down is often
attributed to a learning rate that is too large, it can also be a bug in the code.
Choosing an extremely small learning rate can help identify if this is a bug in the
code, because the algorithm should still behave abnormally.

Gradient Descent for Linear Regression


Given all the equations and definitions above, almost everything is present to
programmatically compute the gradient descent for a linear regression. However,
there is one final piece missing, which is expressing the derivative of the cost
∂ ∂
function ∂w J (w, b) and ∂b J (w, b).
Since the cost function and the linear regression model have already been
expressed in terms of the input variables x and y, they are substituted into the
gradient descent algorithm. The derivative is then taken. Consider the following
equations, where m is the number of training examples in the dataset.

Linear regression model Cost function


m
fw,b (x) = wx + b 1
J (w, b) = ∑(fw,b (xi ) − yi )2
2m
i=1

Pre-derived gradient descent algorithm


m
∂ 1
w =w−α J (w, b) ⇒ ∑(fw,b (xi ) − yi )xi )
∂w m
i=1

m
∂ 1
b = b − α J (w, b) ⇒ ∑(fw,b (xi ) − yi )
∂b m
i=1

Final gradient descent algorithm

Gradient Descent 4
m
1
w = w − α ∑(fw,b (xi ) − yi )xi
m
i=1

m
1
b = b − α ∑(fw,b (xi ) − yi )
m
i=1

Gradient Descent for Multiple Linear


Regression
Utilizing gradient descent for multiple features follows a very similar process as
multiple feature linear regression: adding vector notation. Consider the following
pre-derived gradient descent algorithm, where multiple features w are used for m
training examples in the dataset,

Pre-derived gradient descent algorithm



w =w−α J (w , b)
∂w

b= b−α J (w , b)
∂b

Final gradient descent algorithm


m
1
wj = wj − α ∑(fw ,b (xi ) − yi )xj
(i)
m i=1
simultaneously update
m
1
bn = bn − α ∑(fw ,b (xi ) − yi ) wj (for j = 1, ⋯ , n) and b
m i=1

Notice how b does not require utilizing n features, because b is a value that does
not change based on the parameter being used. Also, the second xi in the first
equation for w does not use vector notation, and this is because that variable is
dependent on the current row j .

Normal Equation - Alternative to


Gradient Descent

Gradient Descent 5
The normal equation is a technique that serves as an alternative to gradient
descent, but it is only for linear regression. It solves for w, b without iterations.
While it can be a simple alternative to gradient descent, it is slow when the number
of features is large (> 10,000).

An advanced linear algebra library is required, so engineers hardly utilize the


normal equation directly. Is is usually used on the backend of machine learning
libraries that implement linear regression. For most learning algorithms, gradient
descent is often the better way to get the job done.

Scaling and Engineering Features


Choosing a reasonable slope w for each parameter is essential for running an
accurate gradient descent algorithm. Some features will have relatively small
values, while others will have big values, and the slopes must accommodate this
relativity. Usually, a bigger w value is chosen for smaller feature sizes xj and vice
versa, which tends to balance the values out. In addition, the features themselves
can be scaled. An example of this using 1000ft2 instead of ft2 , when used in
conjunction with the number of bedrooms. After this scaling, both parameter values
should lie from around 0 to 20.

What happens if the features are not


scaled? Well, this is probably best
shown on a plot to visualize. If carefully
selected w values are not chosen, the
algorithm can often overshoot its
target, which will lead to divergence or
much longer training times. Properly
scaled features should fill an entire
graph, when plotted relatively to each
other.

Scaling Techniques
There are several techniques for scaling features. Considering the following range
for each feature xj , where min ≤ xj ≤ max, the goal is to aim for a range of
−1 ≤ xj ≤ 1.
When using the mean and z-score normalization techniques below, it is often
necessary to store these values for future use. Once the parameters from the

Gradient Descent 6
model have been learned, and predictions for new data are needed, the new input
data x must be normalized. This normalization uses the mean and standard
deviation previously computed from the training set.

Relative maximum
The first technique is to take each feature and divide by the maximum in its range.
which will normalize to a maximum of 1.

xj min
xj,scaled = ≤ xj,scaled ≤ 1
max max

Mean normalization
Here, each feature is normalized to 0, which when plotted, will center around the
origin of the graph. This will usually produce values between -1 and 1 for both x
and y axes of the graph. The mean value uj of all values of parameter xj is
necessary for this calculation.

1
m xj − μj
∑ xj xj =
(i)
μj = max − min
m i=1

Z-score normalization
A z-score normalization utilizes the
standard deviation σ of each feature.
In general terms, a standard deviation
is also referred to as a normal
distribution or gaussian distribution
(bell-shape curve).

Just like the mean normalization, the mean value uj is used. The results of this
technique will produce values around the origin, but they will not necessarily be
constrained by -1 and 1. Consider the following equations, for m training examples
for each feature.

m xj − μj
1 xj =
∑(xj − μj )
(i)
σj = σj
m i=1

Gradient Descent 7
Feature Engineering
The concept of engineering features is to derive new features and enhance the
accuracy of the model. This is usually done by transforming or combining existing
features. For example, take predicting a home price, where two of the features
being used are the frontage and depth (both in feet) of the land,

fw ,b (x) = w1 x1 + w2 x2 + b

Since the total area of the land can be used as well, this can be added to the
model,

area = frontage ∗ depth fw ,b (x) = w1 x1 + w2 x2 + w3 x3 + b


x3 = x1 x2

Scaling polynomial features


(n)
Feature engineering also includes modifying features to be polynomial xj .
Remember, polynomial equations also include square roots, cubed roots, etc.
Sometimes it may be obvious that a feature must be a specific polynomial, like a
basic quadratic equation, but this is not always the case.

If the wrong powers are chosen for each feature, the model function will tend to
balance as the iterations continue. In other words, more weight will be added to
slope w for the features that best fit the data.

For example, consider the equation y = x2 + 1 best fits a certain data set, and
the polynomial equation y = x + x2 + x3 is chosen. After running the gradient
descent, the results may give values for w and b that bring the chosen equation
closer to the best fitting one. This could look something like y = 0.08x +
.64x2 + 0.03x3 + .78. Notice how the slopes for x2 and b have much more
weight.

In code, scaling features can be a


matter of utilizing the NumPy library # create target data
x = np.arange(0, 20, 1)
with a specific concatenation function. y = x**2
This will engineer the features so that
# engineer features
they use all degrees that are specified. X = np.c_[x, x**2, x**3]
It is also important to make sure the
# normalize the values
values are normalized after scaling the X = zscore_normalize_features(X)

Gradient Descent 8
features, as the various powers can
distort the scaling.

Gradient Descent for Logistic


Regression
Given the cost function J and the gradient descent algorithms for w and b,

m
1
J (w , b) = − ∑ [−yi log (f w ,b (xi )) + (1 − yi )log (1 − f w ,b (xi ))]
m
i=1

∂ ∂
w =w−α J (w , b) b= b−α J (w , b)
∂w ∂b

The derivatives of the cost function can be calculated as,

m
∂ 1
∑ (f w ,b (xi ) − yi ) xj
(i)
J (w , b) =
∂w m
i=1

m
∂ 1
J (w , b) = ∑ (f w ,b (xi ) − yi )
∂b m
i=1

Then, solving for both w and b,

wj = wj − α [ ∑ (f w ,b (xi ) − yi ) xj ]
m
1 (i)
m
i=1

b = b − α [ ∑ (f w ,b (xi ) − yi )]
m
1
m
i=1

Notice how these equations are exactly like the linear regression equations.
However, the key difference lies within f w ,b (x), where the function for the line
changes. Even though they look the same, they are very different, due to the model
function.

Gradient Descent 9

You might also like