0% found this document useful (0 votes)
16 views60 pages

Ch2 - Lec3 - Linear Regression and Gradient Descent

The document introduces key concepts in machine learning, focusing on linear regression and gradient descent as optimization techniques. It explains the importance of loss functions, specifically Mean Square Error (MSE), in evaluating model accuracy and how gradient descent is used to minimize these errors iteratively. The document also illustrates the process of adjusting model parameters to improve predictions through practical examples and calculations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views60 pages

Ch2 - Lec3 - Linear Regression and Gradient Descent

The document introduces key concepts in machine learning, focusing on linear regression and gradient descent as optimization techniques. It explains the importance of loss functions, specifically Mean Square Error (MSE), in evaluating model accuracy and how gradient descent is used to minimize these errors iteratively. The document also illustrates the process of adjusting model parameters to improve predictions through practical examples and calculations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Machine Learning

Ch2_Lec3_ Linear Regression and Gradient Descent

1
By Kassahun Tamir
Outline
 Review

 Linear Models

 Optimization

 Loss Function

 Gradient Descent

By Kassahun Tamir 2
Review
 Supervised learning focuses on making
predictions using labeled data to train a model.

 The model learns to identify patterns between the


features and the outputs. Once trained, the model can
then predict the output value for new, unseen data.

By Kassahun Tamir 3
Review
 Regression

Regression is a type of supervised learning that is used to predict


continuous values, such as house prices, stock prices, or customer churn.
Regression algorithms learn a function that maps from the input features to
the output value.

By Kassahun Tamir 4
Linear Models

 Building block for many complex machine learning algorithms,


including deep neural networks

 Linear models predict the target variable using a linear function of the
input features

 Linear models:
✗ Linear Regression
✗ Logistic Regression

By Kassahun Tamir 5
Good Model?!
 Accurate Prediction

Predicted Value ≈ Actual


Value

By Kassahun Tamir 6
Optimization
 The process of adjusting a model’s internal settings (parameters) to
get the best accuracy

 Involves minimizing the error between the model’s prediction and the
actual data

By Kassahun Tamir 7
Example

By Kassahun Tamir 8
Example

By Kassahun Tamir 9
Goal of Optimization
 Minimize the Loss (Cost) Function

 Loss Function: a function that simply quantifies the error of the


model

 Loss function for Linear Regression: Mean Square Error (MSE)

By Kassahun Tamir 10
Linear Regression Example
 Imagine you have data on study hours and test scores of 10 students.
By using linear regression, we will draw a straight line that shows how
much scores tend to increase as study hours go up.

By Kassahun Tamir 11
Linear Regression Example

By Kassahun Tamir 12
Linear Regression Example

By Kassahun Tamir 13
Loss Function

MSE = Σ(yi - ŷi)² / n

 yi: actual value for the i-th data point.


 ŷi (pronounced y-hat): predicted value for the i-th
data point by the linear regression model.
 n: Total number of data points.

By Kassahun Tamir 14
Loss Function

yi

ŷi

By Kassahun Tamir 15
Loss Function

yi

ŷi

By Kassahun Tamir 16
Gradient Descent
 Gradient descent is an optimization technique used in machine
learning to minimize the loss function by iteratively descending
(moving in the direction opposite to the gradient) towards the
function’s minimum.

By Kassahun Tamir 17
Gradient Descent: the previous example

Y=
αx + b

Where,
Y – predicted value
x – input value

α - Slope
b – intercept

By Kassahun Tamir 18
Gradient Descent
 So if our goal is to optimize this model, we are going to have to
optimize its parameters i.e. the slope and intercept.

Optimization could be:


- One parameter
- Two parameter

By Kassahun Tamir 19
Optimizing One Parameter
 Let’s fix m and adjust only the intercept

Assume, m = 4.725

Y= 4.725x+b

 Take random value for the intercept (b)


 Let’s take initial hypothesis of b = 50
Y = 4.725x+50

By Kassahun Tamir 20
Optimizing One Parameter

By Kassahun Tamir 21
Model Prediction

By Kassahun Tamir 22
Loss Function: Mean Squared Error
= Σ(yi - ŷi)² / n

By Kassahun Tamir 23
Loss Function: Mean Squared Error

Intercept Loss
50 233.463

Loss too large!!

Let’s try again!

By Kassahun Tamir 24
Loss Function: Mean Squared Error
Let us update b
b = 60

By Kassahun Tamir 25
Loss Function: Mean Squared Error

By Kassahun Tamir 26
Loss Function: Mean Squared Error

Intercept Loss
50 233.463
60 128.713
Still large !!

So we continue updating

By Kassahun Tamir 27
Loss Function: Mean Squared Error
Let us update b
b = 70

By Kassahun Tamir 28
Loss Function: Mean Squared Error
Let us update b
b = 70

By Kassahun Tamir 29
Loss Function: Mean Squared Error
Intercept Loss
50 233.463
60 128.713
70 223.963

Oops!! The loss increased instead of decreasing

By Kassahun Tamir 30
Intercept vs. Loss

By Kassahun Tamir 31
Intercept vs. Loss

By Kassahun Tamir 32
Intercept vs. Loss

By Kassahun Tamir 33
Intercept vs. Loss
 From the above example we might conclude that the step we took
(Step size = 10) is too large resulting in overshooting

 So let us take smaller steps this time (Step size = 1)

By Kassahun Tamir 34
Intercept Loss
50
233.463
51
213.988
52
196.513
53
181.038
……
………
59
130.188
60
128.713
61
129.238
……
……
70
223.963

By Kassahun Tamir 35
This is better but takes high computation time

A much smaller step size such as 0.1 or 0.01 would basically take
Forever to reach the minimum
By Kassahun Tamir 36
Intercept vs. Loss
 So how to determine the perfect step size?

 Fortunately Gradient Descent gives us an answer to this

 As a principle,

“ Do big steps when far from optimal value and do baby steps when
closer to the optimal value”

By Kassahun Tamir 37
Intercept vs. Loss

By Kassahun Tamir 38
How?
Back to our loss function
Loss = Sum of squared residuals
= ∑(Actual – Predicted)2 Predicted = Slope x input + intercept

= ∑(Actual – (Slope x input + intercept))2


We plug in the values for all the 10 data points and sum them
= (55 – ((4.725 x 1) + intercept )2 + (60 – ((4.725 x 2) + intercept)2 + (90 –
((4.725 x 3.5) + intercept)2 + (85– ((4.725 x 6) + intercept)2 + (100 –
((4.725 x 8) + intercept)2 + (95– ((4.725 x 2.5) + intercept)2 +(75– ((4.725
x 5.5) + intercept)2 +(70– ((4.725 x 4) + intercept)2 +(85– ((4.725 x 6.5) +
intercept)2 +(100– ((4.725 x 6) + intercept)2
Parabolic Function for the curve

By Kassahun Tamir 39
What does this mean?
By taking the derivative of this function we can determine the slop at
any value for the intercept.

So let’s take the derivative of the loss function we obtained with


respect to the intercept.

We will apply calculus derivation rules such as addition property and


chain rule.
d(loss function) = ?
d(intercept)

By Kassahun Tamir 40
Result of Derivation
= - 2 (55 – ((4.725 x 1) + intercept )
+ - 2 (60 – ((4.725 x 2) + intercept)
+ - 2 (90 – ((4.725 x 3.5) + intercept)
+ - 2 (85– ((4.725 x 6) + intercept)
+ - 2 (100 – ((4.725 x 8) + intercept)
+ - 2 (95– ((4.725 x 2.5) + intercept)
+ - 2 (75– ((4.725 x 5.5) + intercept)
+ - 2 (70– ((4.725 x 4) + intercept)
+ - 2 (85– ((4.725 x 6.5) + intercept)
+ - 2 (100– ((4.725 x 6) + intercept)

By Kassahun Tamir 41
Result of Derivation

Now that we have the derivative, Gradient Descent will use it to


find where the loss function is minimum

How?
The optimal intercept is where the slope = 0 or approximates 0

So let us start by taking a random intercept value again and


plug it in the loss function.

By Kassahun Tamir 42
Result of Derivation
Intercept = 50
= - 2 (55 – ((4.725 x 1) + 50))
+ - 2 (60 – ((4.725 x 2) + 50))
+ - 2 (90 – ((4.725 x 3.5) + 50))
+ - 2 (85– ((4.725 x 6) + 50))
+ - 2 (100 – ((4.725 x 8) + 50))
+ - 2 (95– ((4.725 x 2.5) + 50))
+ - 2 (75– ((4.725 x 5.5) + 50))
+ - 2 (70– ((4.725 x 4) + 50))
+ - 2 (85– ((4.725 x 6.5) + 50)) Slope = -204.75
+ - 2 (100– ((4.725
By KassahunxTamir
6) + 50)) 43
Result of Derivation
So what is the next value of the intercept?

Determined by step size

Step Size = Slope x Learning Rate

New intercept = Old intercept – Step Size

Let’s take the Learning rate to be 0.01

Step Size = -204.75x 0.01 = -2.0475


New intercept = 50 – (-2.0475) = 52.0475

By Kassahun Tamir 44
Result of Derivation
Iteration I - Intercept = 52.0475

= - 2 (55 – ((4.725 x 1) + 52.0475))

+ - 2 (60 – ((4.725 x 2) + 52.0475))

+ - 2 (90 – ((4.725 x 3.5) + 52.0475))

+ - 2 (85– ((4.725 x 6) + 52.0475))

+ - 2 (100 – ((4.725 x 8) + 52.0475))

+ - 2 (95– ((4.725 x 2.5) + 52.0475))

+ - 2 (75– ((4.725 x 5.5) + 52.0475))

+ - 2 (70– ((4.725 x 4) + 52.0475))

+ - 2 (85– ((4.725 x 6.5) + 52.0475)) Slope = -163.8


+ - 2 (100– ((4.725
By x 52.0475))
6) + Tamir
Kassahun 45
Result of Derivation
Step Size = -163.8 x 0.01 = -1.638
New intercept = 52.0475 – (-1.638) = 53.6855

Iteration II – Intercept = 53.6855

Slope = -131.04
Step Size = -131.04 x 0.01 = -1.31
New intercept = 53.6855 – (-1.31) = 54.9959

Iteration III – Intercept = 54.9959

Slope = -104.832
Step Size = -104.832 x 0.01 = -1.04832
New intercept = 53.6855 – (-1.04832) = 56.04422
By Kassahun Tamir 46
Result of Derivation
 Until when does these steps repeat?
 Until we reach a step size that is very small (0.001) or
Until we reach maximum number of iteration (usually 1000)

Iteration IV – Intercept = 56.04422

Slope = -83.8656

Step Size = -83.8656 x 0.01 = -0.838656


New intercept = 56.04422 – (-0.838656) = 56.882876

By Kassahun Tamir 47
Iteration Intercept Slope Step size New
intercept
…. ….. …… …… ……
V 56.88288 -67.0924 -0.67092 57.5538
VI 57.5538 -53.674 -0.53674 58.09054
VII 58.09054 -42.9392 -0.42939 58.51993
VIII 58.51993 -34.3514 -0.34351 58.86344
IX 58.86344 -27.4812 -0.27481 59.13825
X 59.13825 -21.985 -0.21985 59.3581
XI 59.3581 -17.588 -0.17588 59.53398
XII 59.53398 -17.2362 -0.17236 59.70634
XIII 59.70634 -10.6232 -0.10623 59.81257
XIV 59.81257 -8.4986 -0.08499 59.89756
XV 59.89756 -6.7988 -0.06799 59.96555
XVI 59.96555 -5.439 -0.05439 60.01994
XVII 60.01994 -4.3512 -0.04351 60.06345
XVIII 60.06345 -3.481Tamir -0.03481 60.09826
By Kassahun 48
Iteration Intercept Slope Step size New
intercept
XIX 60.09826 -2.7848 -0.02785 60.12611
XX 60.12611 -2.2278 -0.02228 60.14839
XXI 60.14839 -1.7822 -0.01782 60.16621
XXII 60.16621 -1.4258 -0.01426 60.18047
XXIII 60.18047 -1.1406 -0.01141 60.19188
XXIV 60.19188 -0.9124 -0.00912 60.201
XXV 60.201 -0.73 -0.0073 60.2083
XXVI 60.2083 -0.584 -0.00584 60.21414
XXVII 60.21414 -0.4672 -0.00467 60.21881
XXVIII 60.21881 -0.3738 -0.00374 60.22255
XXIX 60.22255 -0.299 -0.00299 60.22554
XXX 60.22554 -0.2392 -0.00239 60.22793
XXXI 60.22793 -0.1914 -0.00191 60.22984
XXXII 60.22984 -0.1532 -0.00153 60.23137
XXXIII 60.23137 -0.1226 -0.00123 60.2326
XXXIV 60.2326 By Kassahun Tamir
-0.098 -0.00098 60.23358 49
Procedure
 As we can see we will stop our iteration now as the step size is very
small = 0.00098 so our optimal intercept will be
Intercept = 60.23358 ≈60.24
 So our equation will be Y = 4.725*X + 60.24

 Procedure for Gradient Descent


1.Pick a random guess for the parameter to be optimized
2.Plug parameter values into the derivation of the loss function to
find the slope
3.Calculate the step size using: Step size = slope x Learning rate
4.Calculate New parameter value by subtracting step size from old
value.
5.Repeat step 2-4 until step size is less than 0.001 or you have
By Kassahun Tamir 50
reached 1000 iterations.
Trained Model

By Kassahun Tamir 51
Optimizing Two Parameters (Intercept & Slope)

By Kassahun Tamir 52
Optimizing Two Parameters (Intercept & Slope)
Loss function = Sum of squared residuals
= (Actual – Predicted)2
= (Actual – (mx + b))2
= (55 – ((slope x 1) + intercept )2 + (60 – ((slope
x 2) + intercept)2 + (90 – ((slope x 3.5) + intercept)2 +
(85– ((slope x 6) + intercept)2 + (100 – ((slope x 8) +
intercept)2 + (95– ((slope x 2.5) + intercept)2 +(75–
((slope x 5.5) + intercept)2 +(70– ((slope x 4) + intercept)2
+(85– ((slope x 6.5) + intercept)2 +(100– ((slope x 6) +
intercept)2
When you have two or more derivatives of the
same function, they are called a Gradient.
By Kassahun Tamir 53
Derivation
First take the derivative with respect to the intercept
= - 2 (55 – ((slope x 1) + intercept ) - 2 (60 – ((slope x 2) + intercept) - 2 (90 –
((slope x 3.5) + intercept) - 2 (85– ((slope x 6) + intercept) - 2 (100 – ((slope x
8) + intercept) - 2 (95– ((slope x 2.5) + intercept) - 2 (75– ((slope x 5.5) +
intercept) - 2 (70– ((slope x 4) + intercept) - 2 (85– ((slope x 6.5) + intercept)
- 2 (100– ((slope x 6) + intercept)

Then again take the derivative with respect to the slope


= - 2 x 1 (55 – ((slope x 1) + intercept ) - 2 x 2 (60 – ((slope x 2) + intercept) -
2 x 3.5 (90 – ((slope x 3.5) + intercept) - 2 x 6 (85– ((slope x 6) + intercept) - 2
x 8 (100 – ((slope x 8) + intercept) - 2 x 2.5 (95– ((slope x 2.5) + intercept) - 2
x 5.5 (75– ((slope x 5.5) + intercept) - 2 x 4 (70– ((slope x 4) + intercept )- 2 x
6.5 (85– ((slope x 6.5) + intercept) - 2 x 6 (100– ((slope x 6) + intercept)

By Kassahun Tamir 54
Steps
1. Start by picking a random number for the intercept and slope
(intercept = 60 , Slope = 4)
2. Plug in these values into derivation to find the slope of the curve
i. - 2 (55 – ((4x 1) + 60) - 2 (60 – ((4x 2) + 60) - 2 (90 – ((4x 3.5) + 60)
- 2 (85– ((4x 6) + 60) - 2 (100 – ((4x 8) + 60) - 2 (95– ((4x 2.5) + 60)
- 2 (75– ((4x 5.5) + 60) - 2 (70– ((4x 4) + 60) - 2 (85– ((4x 6.5) + 60)
- 2 (100– ((4x 6) + 60)
= -70
ii. - 2 x 1 (55 – ((4x 1) + 60) - 2 x 2 (60 – ((4x 2) + 60)
- 2 x 3.5 (90 – ((4x 3.5) + 60) - 2 x 6 (85– ((4x 6) + 60)
- 2 x 8 (100 – ((4x 8) + 60) - 2 x 2.5 (95– ((4x 2.5) + 60)
- 2 x 5.5 (75– ((4x 5.5) + 60) - 2 x 4 (70– ((4x 4) + 60)
- 2 x 6.5 (85– ((4x 6.5) + 60) - 2 x 6 (100– ((4x 6) + 60)
=0

By Kassahun Tamir 55
Steps
3. Calculate step size for both
We assume our Learning rate to be = 0.001
step size = - 0.16 for intercept
Step size = 0 for slope
4. Calculate New Intercept and Slope
= Old value – step size = 60.16 for intercept
slope stays the same
Repeat these until step size is super small
Intercept Slope Step New Slope Slope Step Size New
size initial
1 60 -70 -0.07 60.07 4 0 0 4
2 60.07 -68.6 -0.0686 60.1386 4 0 0 4
3 60.1386 -67.228 -0.067228 60.20583 4 0 0 4
4
60.20583 -65.8834 -0.0658834 60.27171 4 0 0 4
5 60.27171 -65.75164 -0.06575164 By Kassahun Tamir4
60.33746 0 0 4 56
Steps

Intercept Slope Step size New Slope Slope Step Size New
initial
6

60.3376 -63.248 -0.063248 60.40085 4 0 0 4


7

60.40085 -61.983 -0.061983 60.46283 4 0 0 4


8

60.46283 -60.7434 -0.0607434 60.52357 4 0 0 4


9

60.52357 -59.5286 -0.0595286 60.5831 4 0 0 4


10

60.27171 -65.75164 -0.06575164 60.33746 4 0 0 4


….. ….. ….. ….. Tamir
By Kassahun ……. …… …… 57
Gradient Descent Formula

By Kassahun Tamir 58
Key Takeaways
 Optimization is very crucial in machine leaning

 One of the most common technique of optimization is Gradient


Descent that uses the gradient to iteratively descend to the lowest
point in the Loss function hence the name Gradient Descent.

By Kassahun Tamir 59
Questions?

By Kassahun Tamir 60

You might also like