ML Lecture # 04 Multiple Regression
ML Lecture # 04 Multiple Regression
REGRESSION
Lecture # 04
2
Multiple Linear Regression
• Multiple linear regression refers to the application of
regression analysis techniques to predict a continuous
dependent variable based on multiple independent variables.
• It is a supervised learning algorithm that aims to learn the
relationship between the independent variables and the
dependent variable from a given training dataset, and then use
that learned model to make predictions on new, unseen data.
• The model assumes a linear relationship between the input
features and the target variable.
3
Linear Regression with One Variable
❑ Training set of housing prices
Size in feet2 Price ($) in 1000’s
(x) (y)
2104 460
1416 232
1534 315
852 178
…. ….
❑ Linear Regression model
𝒇𝒘,𝒃 𝒙 = 𝒘𝒙 + 𝒃
4
Multiple Linear Regression
❑ Training set of housing prices
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
❑ Multiple Linear Regression model
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
5
Multiple Linear Regression
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
❑ Notations
𝒙𝒋 : 𝒋𝒕𝒉 feature (e.g. 𝒙𝟑 = No. of Floors)
𝒏: Number of features (𝒏 = 𝟒)
𝒙(𝒊) : Features of 𝒊𝒕𝒉 training example (e.g. 𝒙(𝟐) = [𝟏𝟒𝟏𝟔 𝟑 𝟐 𝟒𝟎]
𝒙𝒊𝒋 : Value of feature 𝒋 in 𝒊𝒕𝒉 training example (e.g. 𝒙𝟐𝟑 = 𝟐) 6
Multiple Linear Regression Model
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + ⋯ + 𝒘𝒏 𝒙𝒏 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘 ∙ 𝒙 + 𝒃 where
𝒘 = 𝒘𝟏 𝒘𝟐 𝒘𝟑 𝒘𝒏
𝒙 = 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝒏
𝒃 is a number 7
Multiple Linear Regression Model
Size in feet2 No. of No. of Age of Price ($) in
(𝒙𝟏 ) Bedrooms Floors House 1000’s
(𝒙𝟐 ) (𝒙𝟑 ) (𝒙𝟒 ) (y)
2104 5 1 45 460
1416 3 2 40 232
1534 2 2 30 315
852 1 1 36 178
…. …. …. …. ….
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒘𝟑 𝒙𝟑 + 𝒘𝟒 𝒙𝟒 + 𝒃
Example:
𝒇𝒘,𝒃 𝒙 = 𝟎. 𝟏𝒙𝟏 + 𝟒𝒙𝟐 + 𝟏𝟎𝒙𝟑 + (−𝟐)𝒙𝟒 +𝟖𝟎
8
Multiple Linear Regression Model
❑ Parameters Parameters
𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝒘 = 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 still a number
❑ Model ❑ Model
𝒇𝒘,𝒃 𝒙 = 𝒘𝟏 𝒙𝟏 + ⋯ + 𝒘𝒏 𝒙𝒏 + 𝒃
𝒇𝒘,𝒃 𝒙 = 𝒘 ∙ 𝒙 + 𝒃
❑ Cost Function
❑ Cost Function
𝑱 𝒘 𝟏 , … , 𝒘𝒏 , 𝒃
𝑱 𝒘, 𝒃
9
Gradient Descent Algorithm
Repeat until convergence
{
𝝏
𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝝏𝒘𝒋
𝝏
𝒃=𝒃−𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃
𝝏𝒃
}
10
Gradient Descent Algorithm
Repeat until convergence Repeat until convergence
{ {
𝝏 𝝏
𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 𝒘𝒋 = 𝒘𝒋 − 𝜶 𝑱 𝒘, 𝒃
𝝏𝒘𝒋 𝝏𝒘𝒋
𝝏 𝝏
𝒃=𝒃−𝜶 𝑱 𝒘𝟏 , … , 𝒘𝒏 , 𝒃 𝒃=𝒃−𝜶 𝑱 𝒘, 𝒃
𝝏𝒃
𝝏𝒃
}
}
11
Gradient Descent Algorithm
One features 𝒏 features (𝒏 ≥ 𝟐)
Repeat until convergence Repeat until convergence
{
{ 𝒎
𝒎 𝟏
𝟏 𝒘𝒋 = 𝒘𝒋 − 𝜶 𝒘𝒋 𝒙𝒊𝒋 + 𝒃 − 𝒚𝒊 𝒙𝒊𝒋
𝒘=𝒘−𝜶 𝒘𝒙𝒊 + 𝒃 − 𝒚𝒊 𝒙𝒊 𝒎
𝒎 𝒊=𝟏
𝒊=𝟏
𝒎
𝒎 𝟏
𝟏 𝒃 = 𝒃 − 𝜶 ((𝒘𝒋 𝒙𝒊𝒋 + 𝒃) − 𝒚𝒊 )
𝒃 = 𝒃 − 𝜶 ((𝒘𝒙𝒊 + 𝒃) − 𝒚𝒊 ) 𝒎
𝒎 𝒊=𝟏
𝒊=𝟏
Simultaneously update
Simultaneously update 𝒘 and 𝒃 𝒘𝒋 (𝒇𝒐𝒓 𝒋 = 𝟏, 𝟐, 𝟑, … , 𝒏) and 𝒃
} }
12
Feature Scaling
• Feature scaling, also known as data normalization or standardization,
is a preprocessing step in machine learning that aims to bring all the
features or input variables to a similar scale.
• In machine learning algorithms, features often have different scales,
units of measurement, or ranges of values. When the features have
significantly different scales, it can affect the performance and
convergence of certain machine learning models.
• Feature scaling addresses this issue by transforming the feature values
to a common scale. The most commonly used techniques for feature
scaling are: Normalization, Standardization, Robust Scaling, etc.
13
Features and Parameters Values
❑ = 𝒘𝟏 𝒙𝟏 + 𝒘𝟐 𝒙𝟐 + 𝒃
𝑷𝒓𝒊𝒄𝒆
❑ 𝒙𝟏 : Size of the house, 𝒙𝟐 : Number of bedrooms
❑ 𝑹𝒂𝒏𝒈𝒆 (𝒙𝟏 ): 𝟑𝟎𝟎 → 𝟐𝟎𝟎𝟎 and 𝑹𝒂𝒏𝒈𝒆 (𝒙𝟐 ): 𝟎 → 𝟓
❑ House: 𝒙𝟏 = 𝟐𝟎𝟎𝟎, 𝒙𝟐 = 𝟓, 𝑷𝒓𝒊𝒄𝒆 = $𝟓𝟎𝟎𝑲 → One Training Example
Size of the parameters 𝒘𝟏 , 𝒘𝟐 ?
❑ 𝟑𝟎𝟎 ≤ 𝒙𝟏 ≤ 𝟐𝟎𝟎𝟎
𝒙𝟏 𝟑𝟎𝟎
❑ 𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 = = = 𝟎. 𝟏𝟓
𝑴𝒂𝒙.𝑹𝒂𝒏𝒈𝒆 𝟐𝟎𝟎𝟎
❑ 𝟎. 𝟏𝟓 ≤ 𝒙𝟏 ≤ 𝟏
❑ 𝟎 ≤ 𝒙𝟐 ≤ 𝟓
𝒙𝟐 𝟎
❑ 𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 = = =𝟎
𝑴𝒂𝒙.𝑹𝒂𝒏𝒈𝒆 𝟓
❑ 𝟎 ≤ 𝒙𝟐 ≤ 𝟏
15
Feature Scaling - Mean Normalization
𝟑𝟎𝟎 ≤ 𝒙𝟏 ≤ 𝟐𝟎𝟎𝟎
❑ Suppose mean 𝝁𝟏 = 𝟔𝟎𝟎
𝑴𝒊𝒏.− 𝝁 𝟑𝟎𝟎−𝟔𝟎𝟎
❑ 𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 = = = −𝟎. 𝟏𝟖
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟐𝟎𝟎𝟎−𝟑𝟎𝟎
𝑴𝒂𝒙.−𝝁 𝟐𝟎𝟎𝟎−𝟔𝟎𝟎
❑ Upper limit = = = 𝟎. 𝟖𝟐
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟐𝟎𝟎𝟎−𝟑𝟎𝟎
❑ −𝟎. 𝟏𝟖 ≤ 𝒙𝟏 ≤ 𝟎. 𝟖𝟐
𝟎 ≤ 𝒙𝟐 ≤ 𝟓
❑ Suppose mean 𝝁𝟐 = 𝟐. 𝟑
𝑴𝒊𝒏.− 𝝁 𝟎−𝟐.𝟑
❑ 𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 = = = −𝟎. 𝟒𝟔
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟓−𝟎
𝑴𝒂𝒙.−𝝁 𝟓−𝟐.𝟑
❑ Upper limit = = = 𝟎. 𝟓𝟒
𝑴𝒂𝒙.−𝑴𝒊𝒏. 𝟓−𝟎
❑ −𝟎. 𝟒𝟔 ≤ 𝒙𝟐 ≤ 𝟎. 𝟓𝟒
16
Feature Scaling - Example
Suppose you have a dataset of exam scores and study hours. The exam scores range
from 50 to 90, and study hours range from 3 to 12. You want to perform mean
normalization on both features.
𝑬𝒙𝒂𝒎 𝑺𝒄𝒐𝒓𝒆(𝒙𝟏 ) = 𝟓𝟎, 𝟔𝟎, 𝟕𝟎, 𝟖𝟎, 𝟗𝟎
𝑺𝒕𝒖𝒅𝒚 𝑯𝒐𝒖𝒓𝒔(𝒙𝟐 ) = 𝟑, 𝟓, 𝟕, 𝟗, 𝟏𝟐
Step 1: Calculate the mean
𝟓𝟎 + 𝟔𝟎 + 𝟕𝟎 + 𝟖𝟎 + 𝟗𝟎
𝒙𝟏 𝝁𝟏 = = 𝟕𝟎
𝟓
𝟑 + 𝟓 + 𝟕 + 𝟖 + 𝟏𝟐
𝒙𝟐 𝝁𝟐 = = 𝟕. 𝟐
𝟓
17
Feature Scaling - Example
Step 2: Normalized each data point (for exam score 𝒙𝟏 )
𝟓𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟓𝟎 = = −𝟎. 𝟓
𝟗𝟎 − 𝟓𝟎
𝟔𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟔𝟎 = = −𝟎. 𝟐𝟓
𝟗𝟎 − 𝟓𝟎
𝟕𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟕𝟎 = =𝟎
𝟗𝟎 − 𝟓𝟎
𝟖𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟖𝟎 = = 𝟎. 𝟐𝟓
𝟗𝟎 − 𝟓𝟎
𝟗𝟎 − 𝟕𝟎
𝒙𝟏 𝑺𝒄𝒂𝒍𝒆𝒅 𝟗𝟎 = = 𝟎. 𝟓
𝟗𝟎 − 𝟓𝟎
−𝟎. 𝟓 ≤ 𝒙𝟏 ≤ 𝟎. 𝟓
18
Feature Scaling - Example
Step 2: Normalized each data point (for study hours 𝒙𝟐 )
𝟑 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟑 = = −𝟎. 𝟒𝟔𝟔𝟕
𝟏𝟐 − 𝟑
𝟓 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟓 = = −𝟎. 𝟐𝟒𝟒𝟒
𝟏𝟐 − 𝟑
𝟕 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟕 = = −𝟎. 𝟎𝟐𝟐𝟐
𝟏𝟐 − 𝟑
𝟖 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟖 = = 𝟎. 𝟐𝟎𝟎𝟎
𝟏𝟐 − 𝟑
𝟏𝟐 − 𝟕. 𝟐
𝒙𝟐 𝑺𝒄𝒂𝒍𝒆𝒅 𝟏𝟐 = = 𝟎. 𝟓𝟑𝟑𝟑
𝟏𝟐 − 𝟑
20
Feature Scaling
22
Checking Gradient Descent for Convergence
❑ Make sure gradient descent
is working correctly
❑ Objective: 𝑴𝒊𝒏. 𝑱 𝒘, 𝒃
𝑱 𝒘, 𝒃 ❑ 𝑱 𝒘, 𝒃 should decrease
after every iteration of
different values of 𝒘 and 𝒃
23
Learning Rate (𝜶)
❑ So when to stop?
❑ Automatic Convergence Test
❑ Let 𝝐 = 𝟏𝟎−𝟑
𝑱 𝒘, 𝒃 ❑ If 𝑱 𝒘, 𝒃 ≤ 𝝐 in one iteration
then declare convergence
❑ Found parameters 𝒘, 𝒃 to get
close to global minimum.
24
Choosing the Learning Rate (𝜶)
❑ Bug in code or (𝜶) is too large
❑ Gradient Descent is not
working
❑ Use smaller 𝜶
25
Choosing the Learning Rate (𝜶)
❑ If 𝜶 is too Large
❑ Gradient Decent may
overshoot - never reach
minimum
❑ Fail to converge, diverge
28
Acknowledgment
• Material presented in these lecture slides is obtained from Prof. Andrew
Ng course on Machine Learning
• Dr. Iftikhar Ahmad’s lecture slides were consulted for assistance.
• .
29