0% found this document useful (0 votes)
9 views

ML Lecture # 04 Multiple Regression

Uploaded by

cajor35307
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

ML Lecture # 04 Multiple Regression

Uploaded by

cajor35307
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

MULTIPLE LINEAR

REGRESSION
Lecture # 04

Dr. Imran Khalil


[email protected]
Contents
• What is Multiple Regression?
• Multiple Regression Model
• Gradient Descent
• Gradient Descent Algorithm
• Feature Scaling
• Choosing of Learning Rate

2
Multiple Linear Regression
• Multiple linear regression refers to the application of
regression analysis techniques to predict a continuous
dependent variable based on multiple independent variables.
• It is a supervised learning algorithm that aims to learn the
relationship between the independent variables and the
dependent variable from a given training dataset, and then use
that learned model to make predictions on new, unseen data.
• The model assumes a linear relationship between the input
features and the target variable.

3
Linear Regression with One Variable
❑ Training set of housing prices
Size in feet2 Price ($) in 1000’s
(x) (y)
2104 460
1416 232
1534 315
852 178
…. ….
❑ Linear Regression model
𝒇𝒘,𝒃 𝒙 = 𝒘𝒙 + 𝒃
4
Multiple Linear Regression
❑ Training set of housing prices
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
❑ Multiple Linear Regression model
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
5
Multiple Linear Regression
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
❑ Notations
𝒙𝒋 : 𝒋𝒕𝒉 feature (e.g. 𝒙𝟑 = No. of Floors)
𝒏: Number of features (𝒏 = 𝟒)
𝒙(𝒊) : Features of 𝒊𝒕𝒉 training example (e.g. 𝒙(𝟐) = [𝟏𝟒𝟏𝟔 𝟑 𝟐 𝟒𝟎]
𝒙𝒊𝒋 : Value of feature 𝒋 in 𝒊𝒕𝒉 training example (e.g. 𝒙𝟐𝟑 = 𝟐) 6
Multiple Linear Regression Model
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ + 𝒘𝒏 𝒙𝒏 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘 ∙ 𝒙 + 𝒃 where
𝒘 = 𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝒏
𝒙 = 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝒏
𝒃 is a number 7
Multiple Linear Regression Model
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….

𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
Example:
𝒇𝒘,𝒃 𝒙 = 𝟎. 𝟏𝒙𝟏 + 𝟒𝒙𝟐 + 𝟏𝟎𝒙𝟑 + (−𝟐)𝒙𝟒 +𝟖𝟎
8
Multiple Linear Regression Model
❑ Parameters Parameters
𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝒘 = 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 still a number
❑ Model ❑ Model
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + ⋯ + 𝒘𝒏 𝒙𝒏 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘 ∙ 𝒙 + 𝒃
❑ Cost Function
❑ Cost Function
𝑱 𝒘 𝟏 , … , 𝒘𝒏 , 𝒃
𝑱 𝒘, 𝒃

9
Gradient Descent Algorithm
Repeat until convergence
{
𝝏
𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝝏𝒘𝒋
𝝏
𝒃=𝒃−𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝝏𝒃
}

10
Gradient Descent Algorithm
Repeat until convergence Repeat until convergence
{ {
𝝏 𝝏
𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘, 𝒃
𝝏𝒘𝒋 𝝏𝒘𝒋

𝝏 𝝏
𝒃=𝒃−𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 𝒃=𝒃−𝜶 𝑱 𝒘, 𝒃
𝝏𝒃
𝝏𝒃
}
}

11
Gradient Descent Algorithm
One features 𝒏 features (𝒏 ≥ 𝟐)
Repeat until convergence Repeat until convergence
{
{ 𝒎
𝒎 𝟏
𝟏 𝒘𝒋 = 𝒘𝒋 − 𝜶 ෍ 𝒘𝒋 𝒙𝒊𝒋 + 𝒃 − 𝒚𝒊 𝒙𝒊𝒋
𝒘=𝒘−𝜶 ෍ 𝒘𝒙𝒊 + 𝒃 − 𝒚𝒊 𝒙𝒊 𝒎
𝒎 𝒊=𝟏
𝒊=𝟏
𝒎
𝒎 𝟏
𝟏 𝒃 = 𝒃 − 𝜶 ෍((𝒘𝒋 𝒙𝒊𝒋 + 𝒃) − 𝒚𝒊 )
𝒃 = 𝒃 − 𝜶 ෍((𝒘𝒙𝒊 + 𝒃) − 𝒚𝒊 ) 𝒎
𝒎 𝒊=𝟏
𝒊=𝟏
Simultaneously update
Simultaneously update 𝒘 and 𝒃 𝒘𝒋 (𝒇𝒐𝒓 𝒋 = 𝟏, 𝟐, 𝟑, … , 𝒏) and 𝒃
} }

12
Feature Scaling
• Feature scaling, also known as data normalization or standardization,
is a preprocessing step in machine learning that aims to bring all the
features or input variables to a similar scale.
• In machine learning algorithms, features often have different scales,
units of measurement, or ranges of values. When the features have
significantly different scales, it can affect the performance and
convergence of certain machine learning models.
• Feature scaling addresses this issue by transforming the feature values
to a common scale. The most commonly used techniques for feature
scaling are: Normalization, Standardization, Robust Scaling, etc.

13
Features and Parameters Values
❑ ෣ = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒃
𝑷𝒓𝒊𝒄𝒆
❑ 𝒙𝟏 : Size of the house, 𝒙𝟐 : Number of bedrooms
❑ 𝑹𝒂𝒏𝒈𝒆 (𝒙𝟏 ): 𝟑𝟎𝟎 → 𝟐𝟎𝟎𝟎 and 𝑹𝒂𝒏𝒈𝒆 (𝒙𝟐 ): 𝟎 → 𝟓
❑ House: 𝒙𝟏 = 𝟐𝟎𝟎𝟎, 𝒙𝟐 = 𝟓, 𝑷𝒓𝒊𝒄𝒆 = $𝟓𝟎𝟎𝑲 → One Training Example
Size of the parameters 𝒘𝟏 , 𝒘𝟐 ?

❑ Suppose 𝒘𝟏 = 𝟓𝟎, 𝒘𝟐 = 𝟎. 𝟏, 𝒃 = 𝟓𝟎 ❑ Suppose 𝒘𝟏 = 𝟎. 𝟏, 𝒘𝟐 = 𝟓𝟎, 𝒃 = 𝟓𝟎


❑ 𝑷𝒓𝒊𝒄𝒆 = 𝟓𝟎 × 𝟐𝟎𝟎𝟎 + 𝟎. 𝟏 × 𝟓 + 𝟓𝟎 ❑ 𝑷𝒓𝒊𝒄𝒆 = 𝟎. 𝟏 × 𝟐𝟎𝟎𝟎 + 𝟓𝟎 × 𝟓 + 𝟓𝟎
❑ 𝑷𝒓𝒊𝒄𝒆 = 𝟏𝟎𝟎, 𝟎𝟎𝟎𝑲 + 𝟎. 𝟓𝑲 + 𝟓𝟎𝑲 ❑ 𝑷𝒓𝒊𝒄𝒆 = 𝟐𝟎𝟎𝑲 + 𝟐𝟓𝟎𝑲 + 𝟓𝟎𝑲
❑ 𝑷𝒓𝒊𝒄𝒆 = $𝟏𝟎𝟎, 𝟎𝟓𝟎, 𝟓𝟎𝟎 ❑ 𝑷𝒓𝒊𝒄𝒆 = $𝟓𝟎𝟎, 𝟎𝟎𝟎
14
Feature Scaling - Normalization

❑ 𝟑𝟎𝟎 ≤ 𝒙𝟏 ≤ 𝟐𝟎𝟎𝟎
𝒙𝟏 𝟑𝟎𝟎
❑ 𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 = = = 𝟎. 𝟏𝟓
𝑴𝒂𝒙.𝑹𝒂𝒏𝒈𝒆 𝟐𝟎𝟎𝟎
❑ 𝟎. 𝟏𝟓 ≤ 𝒙𝟏 ≤ 𝟏
❑ 𝟎 ≤ 𝒙𝟐 ≤ 𝟓
𝒙𝟐 𝟎
❑ 𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 = = =𝟎
𝑴𝒂𝒙.𝑹𝒂𝒏𝒈𝒆 𝟓
❑ 𝟎 ≤ 𝒙𝟐 ≤ 𝟏
15
Feature Scaling - Mean Normalization
𝟑𝟎𝟎 ≤ 𝒙𝟏 ≤ 𝟐𝟎𝟎𝟎
❑ Suppose mean 𝝁𝟏 = 𝟔𝟎𝟎
𝑴𝒊𝒏.− 𝝁 𝟑𝟎𝟎−𝟔𝟎𝟎
❑ 𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 = = = −𝟎. 𝟏𝟖
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟐𝟎𝟎𝟎−𝟑𝟎𝟎
𝑴𝒂𝒙.−𝝁 𝟐𝟎𝟎𝟎−𝟔𝟎𝟎
❑ Upper limit = = = 𝟎. 𝟖𝟐
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟐𝟎𝟎𝟎−𝟑𝟎𝟎
❑ −𝟎. 𝟏𝟖 ≤ 𝒙𝟏 ≤ 𝟎. 𝟖𝟐
𝟎 ≤ 𝒙𝟐 ≤ 𝟓
❑ Suppose mean 𝝁𝟐 = 𝟐. 𝟑
𝑴𝒊𝒏.− 𝝁 𝟎−𝟐.𝟑
❑ 𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 = = = −𝟎. 𝟒𝟔
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟓−𝟎
𝑴𝒂𝒙.−𝝁 𝟓−𝟐.𝟑
❑ Upper limit = = = 𝟎. 𝟓𝟒
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟓−𝟎
❑ −𝟎. 𝟒𝟔 ≤ 𝒙𝟐 ≤ 𝟎. 𝟓𝟒
16
Feature Scaling - Example
Suppose you have a dataset of exam scores and study hours. The exam scores range
from 50 to 90, and study hours range from 3 to 12. You want to perform mean
normalization on both features.
𝑬𝒙𝒂𝒎 𝑺𝒄𝒐𝒓𝒆(𝒙𝟏 ) = 𝟓𝟎, 𝟔𝟎, 𝟕𝟎, 𝟖𝟎, 𝟗𝟎
𝑺𝒕𝒖𝒅𝒚 𝑯𝒐𝒖𝒓𝒔(𝒙𝟐 ) = 𝟑, 𝟓, 𝟕, 𝟗, 𝟏𝟐
Step 1: Calculate the mean
𝟓𝟎 + 𝟔𝟎 + 𝟕𝟎 + 𝟖𝟎 + 𝟗𝟎
𝒙𝟏 𝝁𝟏 = = 𝟕𝟎
𝟓
𝟑 + 𝟓 + 𝟕 + 𝟖 + 𝟏𝟐
𝒙𝟐 𝝁𝟐 = = 𝟕. 𝟐
𝟓

17
Feature Scaling - Example
Step 2: Normalized each data point (for exam score 𝒙𝟏 )

𝟓𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟓𝟎 = = −𝟎. 𝟓
𝟗𝟎 − 𝟓𝟎
𝟔𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟔𝟎 = = −𝟎. 𝟐𝟓
𝟗𝟎 − 𝟓𝟎
𝟕𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟕𝟎 = =𝟎
𝟗𝟎 − 𝟓𝟎
𝟖𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟖𝟎 = = 𝟎. 𝟐𝟓
𝟗𝟎 − 𝟓𝟎
𝟗𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟗𝟎 = = 𝟎. 𝟓
𝟗𝟎 − 𝟓𝟎

−𝟎. 𝟓 ≤ 𝒙𝟏 ≤ 𝟎. 𝟓
18
Feature Scaling - Example
Step 2: Normalized each data point (for study hours 𝒙𝟐 )

𝟑 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟑 = = −𝟎. 𝟒𝟔𝟔𝟕
𝟏𝟐 − 𝟑
𝟓 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟓 = = −𝟎. 𝟐𝟒𝟒𝟒
𝟏𝟐 − 𝟑
𝟕 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟕 = = −𝟎. 𝟎𝟐𝟐𝟐
𝟏𝟐 − 𝟑
𝟖 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟖 = = 𝟎. 𝟐𝟎𝟎𝟎
𝟏𝟐 − 𝟑
𝟏𝟐 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟏𝟐 = = 𝟎. 𝟓𝟑𝟑𝟑
𝟏𝟐 − 𝟑

−𝟎. 𝟒𝟔𝟔𝟕 ≤ 𝒙𝟏 ≤ 𝟎. 𝟓𝟑𝟑𝟑


19
Feature Scaling - Practice Problem
Suppose you have a dataset of daily high temperatures in two different cities, City A
and City B, for a week. The data is as follows:
City A (in degrees Celsius) = 𝟐𝟓, 𝟐𝟖, 𝟑𝟎, 𝟑𝟐, 𝟑𝟒, 𝟑𝟓, 𝟑𝟔
City B (in degrees Celsius) = 𝟏𝟐, 𝟏𝟓, 𝟏𝟖, 𝟐𝟎, 𝟐𝟐, 𝟐𝟓, 𝟐𝟕
You want to mean normalize these temperature values for each city to compare them
more easily.

20
Feature Scaling

Aim for about −𝟏 ≤ 𝒙𝒋 ≤ 𝟏 for each feature 𝒙𝒋


A𝐜𝐜𝐞𝐩𝐭𝐚𝐛𝐥𝐞 𝐑𝐚𝐧𝐠𝐞𝐬 − 𝐍𝐨 𝐫𝐞𝐬𝐜𝐚𝐥𝐢𝐧𝐠
❑ −𝟑 ≤ 𝒙𝒋 ≤ 𝟑
❑ −𝟎. 𝟑 ≤ 𝒙𝒋 ≤ 𝟎. 𝟑
❑ 𝟎 ≤ 𝒙𝒋 ≤ 𝟑
❑ −𝟐 ≤ 𝒙𝒋 ≤ 𝟎. 𝟓
Not A𝐜𝐜𝐞𝐩𝐭𝐚𝐛𝐥𝐞 𝐑𝐚𝐧𝐠𝐞𝐬 − 𝐫𝐞𝐬𝐜𝐚𝐥𝐢𝐧𝐠
❑ −𝟏𝟎𝟎 ≤ 𝒙𝟏 ≤ 𝟏𝟎𝟎 too large
❑ −𝟗𝟖. 𝟓 ≤ 𝒙𝟏 ≤ 𝟏𝟎𝟓 too large
❑ −𝟎. 𝟎𝟎𝟏 ≤ 𝒙𝟏 ≤ 𝟎. 𝟎𝟎𝟏 too small
21
Feature Scaling - Question

22
Checking Gradient Descent for Convergence
❑ Make sure gradient descent
is working correctly
❑ Objective: 𝑴𝒊𝒏. 𝑱 𝒘, 𝒃
𝑱 𝒘, 𝒃 ❑ 𝑱 𝒘, 𝒃 should decrease
after every iteration of
different values of 𝒘 and 𝒃

23
Learning Rate (𝜶)
❑ So when to stop?
❑ Automatic Convergence Test
❑ Let 𝝐 = 𝟏𝟎−𝟑
𝑱 𝒘, 𝒃 ❑ If 𝑱 𝒘, 𝒃 ≤ 𝝐 in one iteration
then declare convergence
❑ Found parameters 𝒘, 𝒃 to get
close to global minimum.

24
Choosing the Learning Rate (𝜶)
❑ Bug in code or (𝜶) is too large
❑ Gradient Descent is not
working
❑ Use smaller 𝜶

25
Choosing the Learning Rate (𝜶)
❑ If 𝜶 is too Large
❑ Gradient Decent may
overshoot - never reach
minimum
❑ Fail to converge, diverge

Large steps towards


reaching the
minimum
26
Choosing the Learning Rate (𝜶)
❑ If 𝜶 is too small
❑ Gradient Decent may be slow
❑ Eventually reaches the
minimum point

Small steps towards


reaching the
minimum
27
Values of Learning Rate (𝜶) to try
❑ 𝜶 = 𝟎. 𝟎𝟎𝟏
❑ 𝜶 = 𝟎. 𝟎𝟏
❑ 𝜶 = 𝟎. 𝟏
❑ 𝜶=𝟏

28
Acknowledgment
• Material presented in these lecture slides is obtained from Prof. Andrew
Ng course on Machine Learning
• Dr. Iftikhar Ahmad’s lecture slides were consulted for assistance.
• .

29

You might also like