Gradient descent

The document explains gradient descent, an optimization algorithm used in machine learning to minimize a cost function by iteratively adjusting parameters. It describes the process of gradient descent, including the calculation of cost, derivatives, and updates to coefficients, and introduces stochastic gradient descent, which updates model weights based on individual training instances. The document also illustrates the application of gradient descent in simple linear regression with a dataset, detailing the iterative updates to coefficients and the final predictions made by the model.

Uploaded by

phantomx443

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Gradient descent

Uploaded by

phantomx443

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Gradient descent

• Optimization is a big part of machine learning. Almost

every machine learning algorithm has an optimization
algorithm at it’s core
• Gradient Descent
• Gradient descent is an optimization algorithm used to
find the values of parameters (coefficients) of a
function (f) that minimizes a cost function (cost).
• Gradient descent is best used when the parameters
cannot be calculated analytically (e.g. using linear
algebra) and must be searched for by an optimization
algorithm.
Intuition of GD
• Think of a large bowl
• This bowl is a plot of the cost function (f).
• A random position on the surface of the bowl is the
cost of the current values of the coefficients (cost).
• The bottom of the bowl is the cost of the best set of
coefficients, the minimum of the function.
• The goal is to continue to try different values for the
coefficients, evaluate their cost and select new
coefficients that have a slightly better (lower) cost.
• Repeating this process enough times will lead to the
bottom of the bowl and you will know the values of the
coefficients that result in the minimum cost.
Gradient Descent Procedure

• The procedure starts off with initial values for the coefficient or
coefficients for the function. These could be 0.0 or a small random value.
coefficient = 0.0
• The cost of the coefficients is evaluated by plugging them into the function
and calculating the cost.
• cost = f(coefficient)
• or
• cost = evaluate(f(coefficient))
• The derivative of the cost is calculated. The derivative is a concept from
calculus and refers to the slope of the function at a given point. We need
to know the slope so that we know the direction (sign) to move the
coefficient values in order to get a lower cost on the next iteration.
• delta = derivative(cost)
• we know from the derivative which direction is
downhill, we can now update the coefficient
values. A learning rate parameter (alpha) must be
specified that controls how much the coefficients
can change on each update.
• coefficient = coefficient – (alpha * delta)
• This process is repeated until the cost of the
coefficients (cost) is 0.0 or close enough to zero
to be good enough.
• This is gradient descent approach.
• how to use Stochastic Gradient Descent to
learn the coefficients for a simple linear
regression model by minimizing the error on a
training dataset.
simple scatter plot of x versus y.

Sample Data, x
is input, y is
target variable

1 xy
2 11
3 23
4 43
5 32
6 55

We can see the relationship between x and y looks kind-of linear. As in, we
could probably draw a line somewhere diagonally from the bottom left of the
plot to the top right to generally describe the relationship between the data.
This is a good indication that using linear regression might be appropriate for
this little dataset.
• With simple linear regression we want to model our
data as follows:
• y = B0 + B1 * x
• This is a line where y is the output variable we want to
predict, x is the input variable we know and B0 and B1
are coefficients we need to estimate.
• B0 is called the intercept because it determines where
the line intercepts the y axis. In machine learning we
can call this the bias, because it is added to offset all
predictions that we make. The B1 term is called the
slope because it defines the slope of the line or how x
translates into a y value before we add our bias.
Stochastic Gradient Descent
• Gradient Descent is the process of minimizing a
function by following the gradients of the cost
function.
• This involves knowing the form of the cost as well as
the derivative so that from a given point you know the
gradient and can move in that direction, e.g. downhill
towards the minimum value.
• The way stochastic works is that each training instance
is shown to the model one at a time. The model makes
a prediction for a training instance, the error is
calculated and the model is updated in order to reduce
the error for the next prediction.
This procedure can be used to find the set of coefficients in a
model that result in the smallest error for the model on the
training data. Each iteration the coefficients, called weights
(w) in machine learning language are updated using the
equation:
w = w – alpha * delta
Where w is the coefficient or weight being optimized, alpha
is a learning rate that you must configure (e.g. 0.1) and
gradient is the error for the model on the training data
attributed to the weight.
Simple Linear Regression with Stochastic Gradient
Descent
linear regression model is defined as follows:
y = B0 + B1 * x
Gradient Descent Iteration #1
Let’s start with values of 0.0 for both coefficients.
B0 = 0.0
B1 = 0.0
y = 0.0 + 0.0 * x
We can calculate the error for a prediction as follows:
error = p(i) – y(i)
Where p(i) is the prediction for the i’th instance in our dataset and y(i) is the
i’th output variable for the instance in the dataset.
We can now calculate he predicted value for y using our starting point
coefficients for the first training instance:
x=1, y=1
p(i) = 0.0 + 0.0 * 1
p(i) = 0
Using the predicted output, we can calculate our error:
error = 0 – 1
error = -1
We can now use this error in our equation for gradient descent to update the
weights. We will start with updating the intercept first, because it is easier.
We can say that B0 is accountable for all of the error. This is to say that
updating the weight will use just the error as the gradient. We can calculate
the update for the B0 coefficient as follows:
B0(t+1) = B0(t) – alpha * error

Where B0(t+1) is the updated version of the coefficient we will use on the
next training instance, B0(t) is the current value for B0 alpha is our learning
rate and error is the error we calculate for the training instance. Let’s use a
small learning rate of 0.01 and plug the values into the equation to work out
what the new and slightly optimized value of B0 will be:
B0(t+1) = 0.0 – 0.01 * -1.0
B0(t+1) = 0.01

Now, let’s look at updating the value for B1. We use the same equation with
one small change. The error is filtered by the input that caused it. We can
update B1 using the equation:
B1(t+1) = B1(t) – alpha * error * x
Where B1(t+1) is the update coefficient, B1(t) is the current version of the
coefficient, alpha is the same learning rate described above, error is the
same error calculated above and x is the input value.

We can plug in our numbers into the equation and calculate the updated
value for B1:
B1(t+1) = 0.0 – 0.01 * -1 * 1
B1(t+1) = 0.01

We have just finished the first iteration of gradient descent and we have
updated our weights to be B0=0.01 and B1=0.01. This process must be
repeated for the remaining 4 instances from our dataset.
One pass through the training dataset is called an epoch.
list of all of the values for the coefficients over the 20 iterations

1 B0 B1
2 0.01 0.01
3 0.0397 0.0694
4 0.066527 0.176708
5 0.08056049 0.21880847
6 0.1188144616 0.410078328
7 0.1235255337 0.4147894001
8 0.1439944904 0.4557273134
9 0.1543254529 0.4970511637
10 0.1578706635 0.5076867953
11 0.1809076171 0.6228715633
12 0.1828698253 0.6248337715
13 0.1985444516 0.6561830242
14 0.2003116861 0.6632519622
15 0.1984110104 0.657549935
16 0.2135494035 0.7332419008
17 0.2140814905 0.7337739877
18 0.2272651958 0.7601413984
19 0.2245868879 0.7494281668
20 0.219858174 0.7352420252
21 0.230897491 0.7904386102
A plot of the error for each set of coefficients as the learning process unfolded. This is
a useful graph as it shows us that error was decreasing with each iteration and
starting to bounce around a bit towards the end.

Final coefficients have the values

B0=0.230897491 and B1=0.7904386102
plug them into our simple linear Regression model and make a prediction for each point in our
training dataset.
1 X y prediction
2 1 1 0.9551001992
3 2 3 1.690342224
4 4 3 3.160826275
5 3 2 2.42558425
6 5 5 3.8960683

plot our dataset again with these predictions overlaid (x vs y and x vs prediction).
Drawing a line through the 5 predictions gives us an idea of how well the model
fits the training data.

Data Science Cheat Sheet
100% (1)
Data Science Cheat Sheet
2 pages
MACHINE LEARNING ALGORITHM Unit-II
No ratings yet
MACHINE LEARNING ALGORITHM Unit-II
115 pages
Gdesc LMS
No ratings yet
Gdesc LMS
7 pages
Regression
No ratings yet
Regression
16 pages
Logistic Regression Example
100% (1)
Logistic Regression Example
22 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Gradient Descent Algorithm
No ratings yet
Gradient Descent Algorithm
6 pages
Basic Machine Learning: Case Study
No ratings yet
Basic Machine Learning: Case Study
11 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
Stanford ML CS229-Merged Notes
No ratings yet
Stanford ML CS229-Merged Notes
126 pages
Machine Learning Notes by Standard Andrew Ng
No ratings yet
Machine Learning Notes by Standard Andrew Ng
142 pages
Machine Learning Notes AndrewNg
No ratings yet
Machine Learning Notes AndrewNg
141 pages
Linearna Regresija - NG
No ratings yet
Linearna Regresija - NG
7 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
293 pages
Regression
No ratings yet
Regression
30 pages
gradient-descent-from-scratch-complete-intuition
No ratings yet
gradient-descent-from-scratch-complete-intuition
8 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
15 pages
Cl-Vii Ass2 4301063
No ratings yet
Cl-Vii Ass2 4301063
5 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Experiment N1
No ratings yet
Experiment N1
7 pages
vertopal.com_22644501_lab02 (4)
No ratings yet
vertopal.com_22644501_lab02 (4)
14 pages
cs229 Notes1 PDF
No ratings yet
cs229 Notes1 PDF
28 pages
Linear Regression Notes
No ratings yet
Linear Regression Notes
25 pages
CSE_412__Lab_Manual_3___Linear_Regression
No ratings yet
CSE_412__Lab_Manual_3___Linear_Regression
10 pages
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Lecture 2-Linear-Regression-Part1
No ratings yet
Lecture 2-Linear-Regression-Part1
80 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Unit 3.1 Gradient Descent in Linear Regression
No ratings yet
Unit 3.1 Gradient Descent in Linear Regression
6 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
No ratings yet
Q. (A) What Are Different Types of Machine Learning? Discuss The Differences
12 pages
Lect 7 - Gradient Descent
No ratings yet
Lect 7 - Gradient Descent
13 pages
Machine Learning Lab (3) Report (21 CP 81)
No ratings yet
Machine Learning Lab (3) Report (21 CP 81)
7 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
Linear Regression Using Gradient Descent
No ratings yet
Linear Regression Using Gradient Descent
2 pages
Linear Regression
No ratings yet
Linear Regression
36 pages
CS435 Ch5
No ratings yet
CS435 Ch5
15 pages
Lab5 Linear Regression
No ratings yet
Lab5 Linear Regression
1 page
Unit-III Advanced Machine Learning
No ratings yet
Unit-III Advanced Machine Learning
8 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37 (1)
No ratings yet
ML MU Unit 3RegressionTechniquespdf 2025 02-07-10!56!37 (1)
115 pages
Lect03 CSN382
No ratings yet
Lect03 CSN382
31 pages
mayhoc
No ratings yet
mayhoc
51 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
[PR 2024] Lec2 Regression II
No ratings yet
[PR 2024] Lec2 Regression II
41 pages
Linear Regression
No ratings yet
Linear Regression
11 pages
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
No ratings yet
1.1 ID5059 1.2 Tom Kelsey - Jan 2021: February 15, 2021
43 pages
Module I-Part 1
No ratings yet
Module I-Part 1
48 pages
What Is Machine Learning by Coursera
No ratings yet
What Is Machine Learning by Coursera
47 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
APSC 258 Midterm Study Guide
No ratings yet
APSC 258 Midterm Study Guide
4 pages
s&Ml Unit 5- q & A
No ratings yet
s&Ml Unit 5- q & A
15 pages
Student Solutions Manual for Mathematics for Economics, fourth edition
From Everand
Student Solutions Manual for Mathematics for Economics, fourth edition
Michael Hoy
No ratings yet
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet
Bresenham Line Algorithm: Efficient Pixel-Perfect Line Rendering for Computer Vision
From Everand
Bresenham Line Algorithm: Efficient Pixel-Perfect Line Rendering for Computer Vision
Fouad Sabry
No ratings yet
H-AML_2024_2025_Scheme for Internal Assessment_
No ratings yet
H-AML_2024_2025_Scheme for Internal Assessment_
5 pages
Module 2_Deep_Learning_Fundamentals
No ratings yet
Module 2_Deep_Learning_Fundamentals
98 pages
Module3UX Design Process_V1.2 (2)
No ratings yet
Module3UX Design Process_V1.2 (2)
68 pages
Module4UxDWireframmingAndPRotottyping (2)
No ratings yet
Module4UxDWireframmingAndPRotottyping (2)
85 pages
fin_irjmets1682089319
No ratings yet
fin_irjmets1682089319
19 pages
Introduction to computational linear algebra 1st Edition Erhel 2024 scribd download
100% (6)
Introduction to computational linear algebra 1st Edition Erhel 2024 scribd download
60 pages
Ankit(Cbnst)
No ratings yet
Ankit(Cbnst)
23 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
49 pages
EEE 3521 Signal Processing: Synopsys Learning Outcomes
No ratings yet
EEE 3521 Signal Processing: Synopsys Learning Outcomes
5 pages
Tos in Grade 7 Tos
No ratings yet
Tos in Grade 7 Tos
2 pages
AOA Module 4 - Dynamic Programming
No ratings yet
AOA Module 4 - Dynamic Programming
54 pages
3.4 Notes KEY
No ratings yet
3.4 Notes KEY
2 pages
International Baccalaureate (IB) : Artificial Neural Networks - #1
No ratings yet
International Baccalaureate (IB) : Artificial Neural Networks - #1
33 pages
Quarter 1 Math 9 Exam
No ratings yet
Quarter 1 Math 9 Exam
5 pages
Convolution Neural Networks (CNN) : Ms. Anisha Mahato Assistant Professor (CSE Specialization)
No ratings yet
Convolution Neural Networks (CNN) : Ms. Anisha Mahato Assistant Professor (CSE Specialization)
97 pages
Dmu 20
No ratings yet
Dmu 20
11 pages
Ma221-6 1
No ratings yet
Ma221-6 1
7 pages
5 Optimization Techniques
No ratings yet
5 Optimization Techniques
40 pages
(Spring19) - MTL 103 Tut2
No ratings yet
(Spring19) - MTL 103 Tut2
3 pages
(Courant Lecture Notes) Louis Nirenberg-Topics in Nonlinear Functional Analysis - Unknown (2001)
No ratings yet
(Courant Lecture Notes) Louis Nirenberg-Topics in Nonlinear Functional Analysis - Unknown (2001)
153 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Newton - Raphson Method
No ratings yet
Newton - Raphson Method
12 pages
Paper 3 - Numerical Solution of Equations
No ratings yet
Paper 3 - Numerical Solution of Equations
57 pages
Jacobi Method
100% (1)
Jacobi Method
11 pages
Unit 3 Tuning and Optimization Techniques
No ratings yet
Unit 3 Tuning and Optimization Techniques
5 pages
NM 2068 PDF
No ratings yet
NM 2068 PDF
12 pages
AIO2023
No ratings yet
AIO2023
11 pages
Deep Dictionary Learning - 2016
No ratings yet
Deep Dictionary Learning - 2016
9 pages
Open Ended Lab
No ratings yet
Open Ended Lab
4 pages
Machine Learning
100% (4)
Machine Learning
134 pages
Lattice-Based Cryptography
No ratings yet
Lattice-Based Cryptography
2 pages
Multimedia - Eng.ukm - My JKMB Kamal CM Chapter3
No ratings yet
Multimedia - Eng.ukm - My JKMB Kamal CM Chapter3
22 pages
LAB#08: Implementation of Code of Bisection Method and Regula-Falsi Method For Solution of Transcendental Equations in MATLAB
No ratings yet
LAB#08: Implementation of Code of Bisection Method and Regula-Falsi Method For Solution of Transcendental Equations in MATLAB
10 pages
Department of Electrical Engineering Gcu Lahore: Experiment
No ratings yet
Department of Electrical Engineering Gcu Lahore: Experiment
12 pages

Gradient descent

Uploaded by

Gradient descent

Uploaded by

Gradient descent

• Optimization is a big part of machine learning. Almost

Final coefficients have the values

You might also like