Ch2 - Lec3 - Linear Regression and Gradient Descent
Ch2 - Lec3 - Linear Regression and Gradient Descent
1
By Kassahun Tamir
Outline
Review
Linear Models
Optimization
Loss Function
Gradient Descent
By Kassahun Tamir 2
Review
Supervised learning focuses on making
predictions using labeled data to train a model.
By Kassahun Tamir 3
Review
Regression
By Kassahun Tamir 4
Linear Models
Linear models predict the target variable using a linear function of the
input features
Linear models:
✗ Linear Regression
✗ Logistic Regression
By Kassahun Tamir 5
Good Model?!
Accurate Prediction
By Kassahun Tamir 6
Optimization
The process of adjusting a model’s internal settings (parameters) to
get the best accuracy
Involves minimizing the error between the model’s prediction and the
actual data
By Kassahun Tamir 7
Example
By Kassahun Tamir 8
Example
By Kassahun Tamir 9
Goal of Optimization
Minimize the Loss (Cost) Function
By Kassahun Tamir 10
Linear Regression Example
Imagine you have data on study hours and test scores of 10 students.
By using linear regression, we will draw a straight line that shows how
much scores tend to increase as study hours go up.
By Kassahun Tamir 11
Linear Regression Example
By Kassahun Tamir 12
Linear Regression Example
By Kassahun Tamir 13
Loss Function
By Kassahun Tamir 14
Loss Function
yi
ŷi
By Kassahun Tamir 15
Loss Function
yi
ŷi
By Kassahun Tamir 16
Gradient Descent
Gradient descent is an optimization technique used in machine
learning to minimize the loss function by iteratively descending
(moving in the direction opposite to the gradient) towards the
function’s minimum.
By Kassahun Tamir 17
Gradient Descent: the previous example
Y=
αx + b
Where,
Y – predicted value
x – input value
α - Slope
b – intercept
By Kassahun Tamir 18
Gradient Descent
So if our goal is to optimize this model, we are going to have to
optimize its parameters i.e. the slope and intercept.
By Kassahun Tamir 19
Optimizing One Parameter
Let’s fix m and adjust only the intercept
Assume, m = 4.725
Y= 4.725x+b
By Kassahun Tamir 20
Optimizing One Parameter
By Kassahun Tamir 21
Model Prediction
By Kassahun Tamir 22
Loss Function: Mean Squared Error
= Σ(yi - ŷi)² / n
By Kassahun Tamir 23
Loss Function: Mean Squared Error
Intercept Loss
50 233.463
By Kassahun Tamir 24
Loss Function: Mean Squared Error
Let us update b
b = 60
By Kassahun Tamir 25
Loss Function: Mean Squared Error
By Kassahun Tamir 26
Loss Function: Mean Squared Error
Intercept Loss
50 233.463
60 128.713
Still large !!
So we continue updating
By Kassahun Tamir 27
Loss Function: Mean Squared Error
Let us update b
b = 70
By Kassahun Tamir 28
Loss Function: Mean Squared Error
Let us update b
b = 70
By Kassahun Tamir 29
Loss Function: Mean Squared Error
Intercept Loss
50 233.463
60 128.713
70 223.963
By Kassahun Tamir 30
Intercept vs. Loss
By Kassahun Tamir 31
Intercept vs. Loss
By Kassahun Tamir 32
Intercept vs. Loss
By Kassahun Tamir 33
Intercept vs. Loss
From the above example we might conclude that the step we took
(Step size = 10) is too large resulting in overshooting
By Kassahun Tamir 34
Intercept Loss
50
233.463
51
213.988
52
196.513
53
181.038
……
………
59
130.188
60
128.713
61
129.238
……
……
70
223.963
By Kassahun Tamir 35
This is better but takes high computation time
A much smaller step size such as 0.1 or 0.01 would basically take
Forever to reach the minimum
By Kassahun Tamir 36
Intercept vs. Loss
So how to determine the perfect step size?
As a principle,
“ Do big steps when far from optimal value and do baby steps when
closer to the optimal value”
By Kassahun Tamir 37
Intercept vs. Loss
By Kassahun Tamir 38
How?
Back to our loss function
Loss = Sum of squared residuals
= ∑(Actual – Predicted)2 Predicted = Slope x input + intercept
By Kassahun Tamir 39
What does this mean?
By taking the derivative of this function we can determine the slop at
any value for the intercept.
By Kassahun Tamir 40
Result of Derivation
= - 2 (55 – ((4.725 x 1) + intercept )
+ - 2 (60 – ((4.725 x 2) + intercept)
+ - 2 (90 – ((4.725 x 3.5) + intercept)
+ - 2 (85– ((4.725 x 6) + intercept)
+ - 2 (100 – ((4.725 x 8) + intercept)
+ - 2 (95– ((4.725 x 2.5) + intercept)
+ - 2 (75– ((4.725 x 5.5) + intercept)
+ - 2 (70– ((4.725 x 4) + intercept)
+ - 2 (85– ((4.725 x 6.5) + intercept)
+ - 2 (100– ((4.725 x 6) + intercept)
By Kassahun Tamir 41
Result of Derivation
How?
The optimal intercept is where the slope = 0 or approximates 0
By Kassahun Tamir 42
Result of Derivation
Intercept = 50
= - 2 (55 – ((4.725 x 1) + 50))
+ - 2 (60 – ((4.725 x 2) + 50))
+ - 2 (90 – ((4.725 x 3.5) + 50))
+ - 2 (85– ((4.725 x 6) + 50))
+ - 2 (100 – ((4.725 x 8) + 50))
+ - 2 (95– ((4.725 x 2.5) + 50))
+ - 2 (75– ((4.725 x 5.5) + 50))
+ - 2 (70– ((4.725 x 4) + 50))
+ - 2 (85– ((4.725 x 6.5) + 50)) Slope = -204.75
+ - 2 (100– ((4.725
By KassahunxTamir
6) + 50)) 43
Result of Derivation
So what is the next value of the intercept?
By Kassahun Tamir 44
Result of Derivation
Iteration I - Intercept = 52.0475
Slope = -131.04
Step Size = -131.04 x 0.01 = -1.31
New intercept = 53.6855 – (-1.31) = 54.9959
Slope = -104.832
Step Size = -104.832 x 0.01 = -1.04832
New intercept = 53.6855 – (-1.04832) = 56.04422
By Kassahun Tamir 46
Result of Derivation
Until when does these steps repeat?
Until we reach a step size that is very small (0.001) or
Until we reach maximum number of iteration (usually 1000)
Slope = -83.8656
By Kassahun Tamir 47
Iteration Intercept Slope Step size New
intercept
…. ….. …… …… ……
V 56.88288 -67.0924 -0.67092 57.5538
VI 57.5538 -53.674 -0.53674 58.09054
VII 58.09054 -42.9392 -0.42939 58.51993
VIII 58.51993 -34.3514 -0.34351 58.86344
IX 58.86344 -27.4812 -0.27481 59.13825
X 59.13825 -21.985 -0.21985 59.3581
XI 59.3581 -17.588 -0.17588 59.53398
XII 59.53398 -17.2362 -0.17236 59.70634
XIII 59.70634 -10.6232 -0.10623 59.81257
XIV 59.81257 -8.4986 -0.08499 59.89756
XV 59.89756 -6.7988 -0.06799 59.96555
XVI 59.96555 -5.439 -0.05439 60.01994
XVII 60.01994 -4.3512 -0.04351 60.06345
XVIII 60.06345 -3.481Tamir -0.03481 60.09826
By Kassahun 48
Iteration Intercept Slope Step size New
intercept
XIX 60.09826 -2.7848 -0.02785 60.12611
XX 60.12611 -2.2278 -0.02228 60.14839
XXI 60.14839 -1.7822 -0.01782 60.16621
XXII 60.16621 -1.4258 -0.01426 60.18047
XXIII 60.18047 -1.1406 -0.01141 60.19188
XXIV 60.19188 -0.9124 -0.00912 60.201
XXV 60.201 -0.73 -0.0073 60.2083
XXVI 60.2083 -0.584 -0.00584 60.21414
XXVII 60.21414 -0.4672 -0.00467 60.21881
XXVIII 60.21881 -0.3738 -0.00374 60.22255
XXIX 60.22255 -0.299 -0.00299 60.22554
XXX 60.22554 -0.2392 -0.00239 60.22793
XXXI 60.22793 -0.1914 -0.00191 60.22984
XXXII 60.22984 -0.1532 -0.00153 60.23137
XXXIII 60.23137 -0.1226 -0.00123 60.2326
XXXIV 60.2326 By Kassahun Tamir
-0.098 -0.00098 60.23358 49
Procedure
As we can see we will stop our iteration now as the step size is very
small = 0.00098 so our optimal intercept will be
Intercept = 60.23358 ≈60.24
So our equation will be Y = 4.725*X + 60.24
By Kassahun Tamir 51
Optimizing Two Parameters (Intercept & Slope)
By Kassahun Tamir 52
Optimizing Two Parameters (Intercept & Slope)
Loss function = Sum of squared residuals
= (Actual – Predicted)2
= (Actual – (mx + b))2
= (55 – ((slope x 1) + intercept )2 + (60 – ((slope
x 2) + intercept)2 + (90 – ((slope x 3.5) + intercept)2 +
(85– ((slope x 6) + intercept)2 + (100 – ((slope x 8) +
intercept)2 + (95– ((slope x 2.5) + intercept)2 +(75–
((slope x 5.5) + intercept)2 +(70– ((slope x 4) + intercept)2
+(85– ((slope x 6.5) + intercept)2 +(100– ((slope x 6) +
intercept)2
When you have two or more derivatives of the
same function, they are called a Gradient.
By Kassahun Tamir 53
Derivation
First take the derivative with respect to the intercept
= - 2 (55 – ((slope x 1) + intercept ) - 2 (60 – ((slope x 2) + intercept) - 2 (90 –
((slope x 3.5) + intercept) - 2 (85– ((slope x 6) + intercept) - 2 (100 – ((slope x
8) + intercept) - 2 (95– ((slope x 2.5) + intercept) - 2 (75– ((slope x 5.5) +
intercept) - 2 (70– ((slope x 4) + intercept) - 2 (85– ((slope x 6.5) + intercept)
- 2 (100– ((slope x 6) + intercept)
By Kassahun Tamir 54
Steps
1. Start by picking a random number for the intercept and slope
(intercept = 60 , Slope = 4)
2. Plug in these values into derivation to find the slope of the curve
i. - 2 (55 – ((4x 1) + 60) - 2 (60 – ((4x 2) + 60) - 2 (90 – ((4x 3.5) + 60)
- 2 (85– ((4x 6) + 60) - 2 (100 – ((4x 8) + 60) - 2 (95– ((4x 2.5) + 60)
- 2 (75– ((4x 5.5) + 60) - 2 (70– ((4x 4) + 60) - 2 (85– ((4x 6.5) + 60)
- 2 (100– ((4x 6) + 60)
= -70
ii. - 2 x 1 (55 – ((4x 1) + 60) - 2 x 2 (60 – ((4x 2) + 60)
- 2 x 3.5 (90 – ((4x 3.5) + 60) - 2 x 6 (85– ((4x 6) + 60)
- 2 x 8 (100 – ((4x 8) + 60) - 2 x 2.5 (95– ((4x 2.5) + 60)
- 2 x 5.5 (75– ((4x 5.5) + 60) - 2 x 4 (70– ((4x 4) + 60)
- 2 x 6.5 (85– ((4x 6.5) + 60) - 2 x 6 (100– ((4x 6) + 60)
=0
By Kassahun Tamir 55
Steps
3. Calculate step size for both
We assume our Learning rate to be = 0.001
step size = - 0.16 for intercept
Step size = 0 for slope
4. Calculate New Intercept and Slope
= Old value – step size = 60.16 for intercept
slope stays the same
Repeat these until step size is super small
Intercept Slope Step New Slope Slope Step Size New
size initial
1 60 -70 -0.07 60.07 4 0 0 4
2 60.07 -68.6 -0.0686 60.1386 4 0 0 4
3 60.1386 -67.228 -0.067228 60.20583 4 0 0 4
4
60.20583 -65.8834 -0.0658834 60.27171 4 0 0 4
5 60.27171 -65.75164 -0.06575164 By Kassahun Tamir4
60.33746 0 0 4 56
Steps
Intercept Slope Step size New Slope Slope Step Size New
initial
6
By Kassahun Tamir 58
Key Takeaways
Optimization is very crucial in machine leaning
By Kassahun Tamir 59
Questions?
By Kassahun Tamir 60