Gradient Descent
Gradient Descent
Notice how starting at slightly different spots on the peaks can result in completely
different paths and resulting minimums. The lowest J value on any given individual
path is called the local minima.
Gradient Descent 1
Defining the Algorithm
The gradient descent algorithm is defined as a repeated convergence for each
input parameter,
∂
w =w−α J (w, b)
∂w
∂
b= b−α J (w, b)
∂b
α = Learning rate, which is a value that control how big of a step is taken. This is
usually a small positive number, between 0 and 1. A large alpha value corresponds
to a bigger step and vice-versa.
∂
∂(w,b) J (w, b) = Derivative of the cost function, which determines the direction to
take each step.
Notice how the combination of both these values gives the exact coordinate of the
next point in the shortest path. Moreover, when subtracting its value from the
previous parameter w or b, a more optimized value is found and assigned. When
the algorithm finally reaches its lowest point in a path, there will be nothing to
subtract. The convergence to 0 for every input parameter is where the algorithm
ends.
Gradient Descent 2
required, so the total time for the algorithm will grow
dramatically. On the other hand, if the learning rate is
too large, then steps can miss the true minimum cost
function, and the algorithm itself will actually diverge
away from it.
There is also another important question to answer, which is why the algorithm
works with a fixed learning rate. This is because as the steps approach the local
minima, they automatically become smaller due to the derivative itself becoming
smaller (each subsequent slope is smaller as it approaches 0).
Optimizations
Learning curve
Recall that the goal of gradient descent
is to minimize the cost function J w ,b .
Plotting the cost function against the
number of iterations can show how
effective it is from a quick glance. This
curve is specifically called the learning
curve. Ideally, the curve should
converge to 0 as quickly as possible. If
J ever increases after an iterations, it
can often signify a poor learning rate α
selection or a bug in the code.
Gradient Descent 3
algorithm. If J (w , b) decreases by ≤ ϵ in one iteration, declare convergence - The
parameters w , b have been found.
Determining an optimal baseline ϵ can be very tricky, and it is often best used in
conjunction with a learning curve graph.
Debugging
While divergence from 0 or a cost function that moves up and down is often
attributed to a learning rate that is too large, it can also be a bug in the code.
Choosing an extremely small learning rate can help identify if this is a bug in the
code, because the algorithm should still behave abnormally.
m
∂ 1
b = b − α J (w, b) ⇒ ∑(fw,b (xi ) − yi )
∂b m
i=1
Gradient Descent 4
m
1
w = w − α ∑(fw,b (xi ) − yi )xi
m
i=1
m
1
b = b − α ∑(fw,b (xi ) − yi )
m
i=1
Notice how b does not require utilizing n features, because b is a value that does
not change based on the parameter being used. Also, the second xi in the first
equation for w does not use vector notation, and this is because that variable is
dependent on the current row j .
Gradient Descent 5
The normal equation is a technique that serves as an alternative to gradient
descent, but it is only for linear regression. It solves for w, b without iterations.
While it can be a simple alternative to gradient descent, it is slow when the number
of features is large (> 10,000).
Scaling Techniques
There are several techniques for scaling features. Considering the following range
for each feature xj , where min ≤ xj ≤ max, the goal is to aim for a range of
−1 ≤ xj ≤ 1.
When using the mean and z-score normalization techniques below, it is often
necessary to store these values for future use. Once the parameters from the
Gradient Descent 6
model have been learned, and predictions for new data are needed, the new input
data x must be normalized. This normalization uses the mean and standard
deviation previously computed from the training set.
Relative maximum
The first technique is to take each feature and divide by the maximum in its range.
which will normalize to a maximum of 1.
xj min
xj,scaled = ≤ xj,scaled ≤ 1
max max
Mean normalization
Here, each feature is normalized to 0, which when plotted, will center around the
origin of the graph. This will usually produce values between -1 and 1 for both x
and y axes of the graph. The mean value uj of all values of parameter xj is
necessary for this calculation.
1
m xj − μj
∑ xj xj =
(i)
μj = max − min
m i=1
Z-score normalization
A z-score normalization utilizes the
standard deviation σ of each feature.
In general terms, a standard deviation
is also referred to as a normal
distribution or gaussian distribution
(bell-shape curve).
Just like the mean normalization, the mean value uj is used. The results of this
technique will produce values around the origin, but they will not necessarily be
constrained by -1 and 1. Consider the following equations, for m training examples
for each feature.
m xj − μj
1 xj =
∑(xj − μj )
(i)
σj = σj
m i=1
Gradient Descent 7
Feature Engineering
The concept of engineering features is to derive new features and enhance the
accuracy of the model. This is usually done by transforming or combining existing
features. For example, take predicting a home price, where two of the features
being used are the frontage and depth (both in feet) of the land,
fw ,b (x) = w1 x1 + w2 x2 + b
Since the total area of the land can be used as well, this can be added to the
model,
If the wrong powers are chosen for each feature, the model function will tend to
balance as the iterations continue. In other words, more weight will be added to
slope w for the features that best fit the data.
For example, consider the equation y = x2 + 1 best fits a certain data set, and
the polynomial equation y = x + x2 + x3 is chosen. After running the gradient
descent, the results may give values for w and b that bring the chosen equation
closer to the best fitting one. This could look something like y = 0.08x +
.64x2 + 0.03x3 + .78. Notice how the slopes for x2 and b have much more
weight.
Gradient Descent 8
features, as the various powers can
distort the scaling.
m
1
J (w , b) = − ∑ [−yi log (f w ,b (xi )) + (1 − yi )log (1 − f w ,b (xi ))]
m
i=1
∂ ∂
w =w−α J (w , b) b= b−α J (w , b)
∂w ∂b
m
∂ 1
∑ (f w ,b (xi ) − yi ) xj
(i)
J (w , b) =
∂w m
i=1
m
∂ 1
J (w , b) = ∑ (f w ,b (xi ) − yi )
∂b m
i=1
wj = wj − α [ ∑ (f w ,b (xi ) − yi ) xj ]
m
1 (i)
m
i=1
b = b − α [ ∑ (f w ,b (xi ) − yi )]
m
1
m
i=1
Notice how these equations are exactly like the linear regression equations.
However, the key difference lies within f w ,b (x), where the function for the line
changes. Even though they look the same, they are very different, due to the model
function.
Gradient Descent 9