0% found this document useful (0 votes)

26 views9 pages

Gradient Descent

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent. It works by calculating the derivative of the cost function to determine how to adjust the parameters to minimize the cost, repeating this process until convergence. The learning rate must be chosen carefully to ensure convergence without overshooting or diverging from the minimum.

Uploaded by

Rutvik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views9 pages

Gradient Descent

Uploaded by

Rutvik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Gradient Descent

The gradient descent algorithm is used all

over the machine learning landscape; from
optimizing Linear Regression to training
some of the most advanced neural network
models. The goal of gradient descent is to
minimize any function, and it can take any
number of parameters.

Defining the Model

The Algorithm’s Goal
The goal of the algorithm is described below, with is to minimize a general function
J , with a number of parameters n,

min J (w1 , w2 , ..., wn , b)

w1 ,...,wn ,b

As parameters increase, the resulting

minimum for J will become more
complex and can even have multiple
values. The algorithm works by starting
at the most inefficient parameter
selections (red peak spots on the
graph), then working in small
increments to find the shortest path to
the minimum J values (blue valleys).
This process is repeated many times
over for different starting points.

Notice how starting at slightly different spots on the peaks can result in completely
different paths and resulting minimums. The lowest J value on any given individual
path is called the local minima.

Gradient Descent 1
Defining the Algorithm
The gradient descent algorithm is defined as a repeated convergence for each
input parameter,

∂
w =w−α J (w, b)
∂w
∂
b= b−α J (w, b)
∂b
α = Learning rate, which is a value that control how big of a step is taken. This is
usually a small positive number, between 0 and 1. A large alpha value corresponds
to a bigger step and vice-versa.
∂
∂(w,b) J (w, b) = Derivative of the cost function, which determines the direction to
take each step.

Notice how the combination of both these values gives the exact coordinate of the
next point in the shortest path. Moreover, when subtracting its value from the
previous parameter w or b, a more optimized value is found and assigned. When
the algorithm finally reaches its lowest point in a path, there will be nothing to
subtract. The convergence to 0 for every input parameter is where the algorithm
ends.

Since every parameter must be updated, it is extremely important to

simultaneously update w and b at the same time. Otherwise, an updated value will
incorrectly go into the cost function for another.

As for the derivative of the cost function, it is used

because the derivative is the slope of the tangent
line at point w on the cost function curve. This is
more easily understood if the algorithm is
simplified to plotting the cost function J (w)
against w . As this tangent line approaches the
bottom of the curve, its slope will approach 0,
which results in a smaller decrease of w in each
step of the gradient descent path.

Choosing a Learning Rate

The choice of learning rate α will have a huge impact
on the efficiency of the overall algorithm. If the learning
rate is too small, many more steps will be taken than

Gradient Descent 2
required, so the total time for the algorithm will grow
dramatically. On the other hand, if the learning rate is
too large, then steps can miss the true minimum cost
function, and the algorithm itself will actually diverge
away from it.

Now, these examples above are for simple parabolic

graphs. Optimizing α on more complex graphs can be
more tricky. For example, consider a graph where
there are several local minima separated by local
maxima.

There is also another important question to answer, which is why the algorithm
works with a fixed learning rate. This is because as the steps approach the local
minima, they automatically become smaller due to the derivative itself becoming
smaller (each subsequent slope is smaller as it approaches 0).

Optimizations
Learning curve
Recall that the goal of gradient descent
is to minimize the cost function J w ,b .
Plotting the cost function against the
number of iterations can show how
effective it is from a quick glance. This
curve is specifically called the learning
curve. Ideally, the curve should
converge to 0 as quickly as possible. If
J ever increases after an iterations, it
can often signify a poor learning rate α
selection or a bug in the code.

Moreover, the number of iterations can vary significantly depending on application,

which often makes it difficult to initially find how many iterations are needed for
convergence.

Automatic convergence test

An automatic convergence test can also be used. This essentially sets a baseline
value ϵ for how much the cost function should decrease by before stopping the

Gradient Descent 3
algorithm. If J (w , b) decreases by ≤ ϵ in one iteration, declare convergence - The
parameters w , b have been found.

Determining an optimal baseline ϵ can be very tricky, and it is often best used in
conjunction with a learning curve graph.

Debugging
While divergence from 0 or a cost function that moves up and down is often
attributed to a learning rate that is too large, it can also be a bug in the code.
Choosing an extremely small learning rate can help identify if this is a bug in the
code, because the algorithm should still behave abnormally.

Gradient Descent for Linear Regression

Given all the equations and definitions above, almost everything is present to
programmatically compute the gradient descent for a linear regression. However,
there is one final piece missing, which is expressing the derivative of the cost
∂ ∂
function ∂w J (w, b) and ∂b J (w, b).
Since the cost function and the linear regression model have already been
expressed in terms of the input variables x and y, they are substituted into the
gradient descent algorithm. The derivative is then taken. Consider the following
equations, where m is the number of training examples in the dataset.

Linear regression model Cost function

m
fw,b (x) = wx + b 1
J (w, b) = ∑(fw,b (xi ) − yi )2
2m
i=1

Pre-derived gradient descent algorithm

m
∂ 1
w =w−α J (w, b) ⇒ ∑(fw,b (xi ) − yi )xi )
∂w m
i=1

m
∂ 1
b = b − α J (w, b) ⇒ ∑(fw,b (xi ) − yi )
∂b m
i=1

Final gradient descent algorithm

Gradient Descent 4
m
1
w = w − α ∑(fw,b (xi ) − yi )xi
m
i=1

m
1
b = b − α ∑(fw,b (xi ) − yi )
m
i=1

Gradient Descent for Multiple Linear

Regression
Utilizing gradient descent for multiple features follows a very similar process as
multiple feature linear regression: adding vector notation. Consider the following
pre-derived gradient descent algorithm, where multiple features w are used for m
training examples in the dataset,

Pre-derived gradient descent algorithm

∂
w =w−α J (w , b)
∂w
∂
b= b−α J (w , b)
∂b

Final gradient descent algorithm

m
1
wj = wj − α ∑(fw ,b (xi ) − yi )xj
(i)
m i=1
simultaneously update
m
1
bn = bn − α ∑(fw ,b (xi ) − yi ) wj (for j = 1, ⋯ , n) and b
m i=1

Notice how b does not require utilizing n features, because b is a value that does
not change based on the parameter being used. Also, the second xi in the first
equation for w does not use vector notation, and this is because that variable is
dependent on the current row j .

Normal Equation - Alternative to

Gradient Descent

Gradient Descent 5
The normal equation is a technique that serves as an alternative to gradient
descent, but it is only for linear regression. It solves for w, b without iterations.
While it can be a simple alternative to gradient descent, it is slow when the number
of features is large (> 10,000).

An advanced linear algebra library is required, so engineers hardly utilize the

normal equation directly. Is is usually used on the backend of machine learning
libraries that implement linear regression. For most learning algorithms, gradient
descent is often the better way to get the job done.

Scaling and Engineering Features

Choosing a reasonable slope w for each parameter is essential for running an
accurate gradient descent algorithm. Some features will have relatively small
values, while others will have big values, and the slopes must accommodate this
relativity. Usually, a bigger w value is chosen for smaller feature sizes xj and vice
versa, which tends to balance the values out. In addition, the features themselves
can be scaled. An example of this using 1000ft2 instead of ft2 , when used in
conjunction with the number of bedrooms. After this scaling, both parameter values
should lie from around 0 to 20.

What happens if the features are not

scaled? Well, this is probably best
shown on a plot to visualize. If carefully
selected w values are not chosen, the
algorithm can often overshoot its
target, which will lead to divergence or
much longer training times. Properly
scaled features should fill an entire
graph, when plotted relatively to each
other.

Scaling Techniques
There are several techniques for scaling features. Considering the following range
for each feature xj , where min ≤ xj ≤ max, the goal is to aim for a range of
−1 ≤ xj ≤ 1.
When using the mean and z-score normalization techniques below, it is often
necessary to store these values for future use. Once the parameters from the

Gradient Descent 6
model have been learned, and predictions for new data are needed, the new input
data x must be normalized. This normalization uses the mean and standard
deviation previously computed from the training set.

Relative maximum
The first technique is to take each feature and divide by the maximum in its range.
which will normalize to a maximum of 1.

xj min
xj,scaled = ≤ xj,scaled ≤ 1
max max

Mean normalization
Here, each feature is normalized to 0, which when plotted, will center around the
origin of the graph. This will usually produce values between -1 and 1 for both x
and y axes of the graph. The mean value uj of all values of parameter xj is
necessary for this calculation.

1
m xj − μj
∑ xj xj =
(i)
μj = max − min
m i=1

Z-score normalization
A z-score normalization utilizes the
standard deviation σ of each feature.
In general terms, a standard deviation
is also referred to as a normal
distribution or gaussian distribution
(bell-shape curve).

Just like the mean normalization, the mean value uj is used. The results of this
technique will produce values around the origin, but they will not necessarily be
constrained by -1 and 1. Consider the following equations, for m training examples
for each feature.

m xj − μj
1 xj =
∑(xj − μj )
(i)
σj = σj
m i=1

Gradient Descent 7
Feature Engineering
The concept of engineering features is to derive new features and enhance the
accuracy of the model. This is usually done by transforming or combining existing
features. For example, take predicting a home price, where two of the features
being used are the frontage and depth (both in feet) of the land,

fw ,b (x) = w1 x1 + w2 x2 + b

Since the total area of the land can be used as well, this can be added to the
model,

area = frontage ∗ depth fw ,b (x) = w1 x1 + w2 x2 + w3 x3 + b

x3 = x1 x2

Scaling polynomial features

(n)
Feature engineering also includes modifying features to be polynomial xj .
Remember, polynomial equations also include square roots, cubed roots, etc.
Sometimes it may be obvious that a feature must be a specific polynomial, like a
basic quadratic equation, but this is not always the case.

If the wrong powers are chosen for each feature, the model function will tend to
balance as the iterations continue. In other words, more weight will be added to
slope w for the features that best fit the data.

For example, consider the equation y = x2 + 1 best fits a certain data set, and
the polynomial equation y = x + x2 + x3 is chosen. After running the gradient
descent, the results may give values for w and b that bring the chosen equation
closer to the best fitting one. This could look something like y = 0.08x +
.64x2 + 0.03x3 + .78. Notice how the slopes for x2 and b have much more
weight.

In code, scaling features can be a

matter of utilizing the NumPy library # create target data
x = np.arange(0, 20, 1)
with a specific concatenation function. y = x**2
This will engineer the features so that
# engineer features
they use all degrees that are specified. X = np.c_[x, x**2, x**3]
It is also important to make sure the
# normalize the values
values are normalized after scaling the X = zscore_normalize_features(X)

Gradient Descent 8
features, as the various powers can
distort the scaling.

Gradient Descent for Logistic

Regression
Given the cost function J and the gradient descent algorithms for w and b,

m
1
J (w , b) = − ∑ [−yi log (f w ,b (xi )) + (1 − yi )log (1 − f w ,b (xi ))]
m
i=1

∂ ∂
w =w−α J (w , b) b= b−α J (w , b)
∂w ∂b

The derivatives of the cost function can be calculated as,

m
∂ 1
∑ (f w ,b (xi ) − yi ) xj
(i)
J (w , b) =
∂w m
i=1

m
∂ 1
J (w , b) = ∑ (f w ,b (xi ) − yi )
∂b m
i=1

Then, solving for both w and b,

wj = wj − α [ ∑ (f w ,b (xi ) − yi ) xj ]
m
1 (i)
m
i=1

b = b − α [ ∑ (f w ,b (xi ) − yi )]
m
1
m
i=1

Notice how these equations are exactly like the linear regression equations.
However, the key difference lies within f w ,b (x), where the function for the line
changes. Even though they look the same, they are very different, due to the model
function.

Gradient Descent 9

(Cambridge Series in Chemical Engineering) John Charles Slattery-Advanced Transport Phenomena-Cambridge University Press (1999) PDF
100% (2)
(Cambridge Series in Chemical Engineering) John Charles Slattery-Advanced Transport Phenomena-Cambridge University Press (1999) PDF
733 pages
Electromagnetism - Principles and Applications (Lorrain Corson) 0716700646
No ratings yet
Electromagnetism - Principles and Applications (Lorrain Corson) 0716700646
522 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
DL Unit - 2
No ratings yet
DL Unit - 2
20 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Lecture 4 - More On Linear Regression and Polynomial Regression
No ratings yet
Lecture 4 - More On Linear Regression and Polynomial Regression
26 pages
LInear
No ratings yet
LInear
14 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
AI33
No ratings yet
AI33
6 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Module2 Optimizations
No ratings yet
Module2 Optimizations
65 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
(PR 2024) Lec2 Regression II
No ratings yet
(PR 2024) Lec2 Regression II
41 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
8 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Deep Learning (Part 8) - Coursesteach
No ratings yet
Deep Learning (Part 8) - Coursesteach
16 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
9 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
ML Lecture # 03 Gradient Descent
No ratings yet
ML Lecture # 03 Gradient Descent
23 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Gradient Descent A Fundamental Optimization Algorithm
No ratings yet
Gradient Descent A Fundamental Optimization Algorithm
30 pages
04gradient Descent
No ratings yet
04gradient Descent
21 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
43 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Computing For Data Sciences: Introduction To Regression Analysis
No ratings yet
Computing For Data Sciences: Introduction To Regression Analysis
9 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
IAPM Exam Doc 7-8: Session 7 - MCQ Hack
No ratings yet
IAPM Exam Doc 7-8: Session 7 - MCQ Hack
56 pages
CAPM
No ratings yet
CAPM
2 pages
CCD Summary
No ratings yet
CCD Summary
2 pages
Article Summary
No ratings yet
Article Summary
8 pages
Nokia
No ratings yet
Nokia
2 pages
Coke Summary
No ratings yet
Coke Summary
2 pages
Introduction To Calculus: Terminology
100% (2)
Introduction To Calculus: Terminology
56 pages
As Level Pure Oct2021ans
No ratings yet
As Level Pure Oct2021ans
44 pages
An Isotropic 3x3 Image Gradient Operator: February 2014
No ratings yet
An Isotropic 3x3 Image Gradient Operator: February 2014
6 pages
Chapter 1 Functions of Several Variables
No ratings yet
Chapter 1 Functions of Several Variables
78 pages
Del Operator
100% (1)
Del Operator
6 pages
Emf - Theory-Ilovepdf-Compressed (1) - Min
No ratings yet
Emf - Theory-Ilovepdf-Compressed (1) - Min
1,133 pages
Multivariable and Vector Calculus PDF
No ratings yet
Multivariable and Vector Calculus PDF
2 pages
OCR C1 June 05 - Jun 10 PDF
No ratings yet
OCR C1 June 05 - Jun 10 PDF
25 pages
P2 - 9H - Implicit Differentiation - QP1
No ratings yet
P2 - 9H - Implicit Differentiation - QP1
14 pages
Differentiation Practice
No ratings yet
Differentiation Practice
71 pages
Rise of The Machines: Larry Wasserman
No ratings yet
Rise of The Machines: Larry Wasserman
12 pages
4 Gradient 1
No ratings yet
4 Gradient 1
4 pages
Midyear 2016: STUDENT'S GUIDE FOR MATH 38 (Mathematical Analysis III)
No ratings yet
Midyear 2016: STUDENT'S GUIDE FOR MATH 38 (Mathematical Analysis III)
3 pages
Vector Differentiation
No ratings yet
Vector Differentiation
33 pages
Introduction To Differentiable Physics - Physics-Based Deep Learning
No ratings yet
Introduction To Differentiable Physics - Physics-Based Deep Learning
8 pages
SVG PRJ
No ratings yet
SVG PRJ
11 pages
ML Full Slides Final
No ratings yet
ML Full Slides Final
458 pages
MAT397 SP 11 Practice Exam 2 Solutions
No ratings yet
MAT397 SP 11 Practice Exam 2 Solutions
7 pages
Distributions
No ratings yet
Distributions
30 pages
ML - Full Slides Srikanth Allamshatty
No ratings yet
ML - Full Slides Srikanth Allamshatty
369 pages
Gradient of Tangent
No ratings yet
Gradient of Tangent
1 page
How To Use Adobe Photoshop CS6 Tools
No ratings yet
How To Use Adobe Photoshop CS6 Tools
3 pages
12CMT - Differentiation and Integration Daily Planner
No ratings yet
12CMT - Differentiation and Integration Daily Planner
6 pages
Theory of Elasticity Chapter 1
No ratings yet
Theory of Elasticity Chapter 1
15 pages
ML Cheatsheet PDF
100% (1)
ML Cheatsheet PDF
211 pages
Methods Unit 3 - Chapter 6
No ratings yet
Methods Unit 3 - Chapter 6
20 pages
Effect of Discimination Training On Auditory Generalization - Jenkins & Harrison
No ratings yet
Effect of Discimination Training On Auditory Generalization - Jenkins & Harrison
8 pages
Crowd Monitoring Using HOG
No ratings yet
Crowd Monitoring Using HOG
8 pages

Gradient Descent

Uploaded by

Gradient Descent

Uploaded by

Gradient Descent

The gradient descent algorithm is used all

Defining the Model

min J (w1 , w2 , ..., wn , b)

As parameters increase, the resulting

Since every parameter must be updated, it is extremely important to

As for the derivative of the cost function, it is used

Choosing a Learning Rate

Now, these examples above are for simple parabolic

Moreover, the number of iterations can vary significantly depending on application,

Automatic convergence test

Gradient Descent for Linear Regression

Linear regression model Cost function

Pre-derived gradient descent algorithm

Final gradient descent algorithm

Gradient Descent for Multiple Linear

Pre-derived gradient descent algorithm

Final gradient descent algorithm

Normal Equation - Alternative to

An advanced linear algebra library is required, so engineers hardly utilize the

Scaling and Engineering Features

What happens if the features are not

area = frontage ∗ depth fw ,b (x) = w1 x1 + w2 x2 + w3 x3 + b

Scaling polynomial features

In code, scaling features can be a

Gradient Descent for Logistic

The derivatives of the cost function can be calculated as,

Then, solving for both w and b,

You might also like