0% found this document useful (0 votes)
12 views49 pages

Lecture 4 - Multiple Linear Regression Imran 20022025 092939am

The document provides a comprehensive overview of Multiple Linear Regression (MLR), explaining its purpose, assumptions, and methodologies for implementation. It details how MLR can be used to analyze relationships between multiple independent variables and a dependent variable, including statistical concepts like t-tests and p-values. Additionally, it outlines various strategies for selecting independent variables and includes Python code examples for practical application.

Uploaded by

ridasaman47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views49 pages

Lecture 4 - Multiple Linear Regression Imran 20022025 092939am

The document provides a comprehensive overview of Multiple Linear Regression (MLR), explaining its purpose, assumptions, and methodologies for implementation. It details how MLR can be used to analyze relationships between multiple independent variables and a dependent variable, including statistical concepts like t-tests and p-values. Additionally, it outlines various strategies for selecting independent variables and includes Python code examples for practical application.

Uploaded by

ridasaman47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Multiple Linear

Regression
Introduction to Machine Learning
Contents
1. What is multiple linear regression (MLR)

2. What multiple linear regression can help you do.

3. Assumptions of multiple linear regression

4. How to perform a multiple linear regression


i. T-test
ii. P-value
iii. The model
iv. Selecting the independent variables
v. Python Code
vi. Example of backward elimination for selection of independent variables
What is MLR Multiple linear regression is used to
estimate the relationship between two
or more independent variables and one
dependent variable.
What multiple linear regression can
help you do.
• You can use multiple linear regression when you want to know:
 How strong the relationship is between two or more independent variables and one
dependent variable (e.g. how rainfall, temperature, and amount of fertilizer added
affect crop growth).
 The value of the dependent variable at a certain value of the independent variables
(e.g. the expected yield of a crop at certain levels of rainfall, temperature, and
fertilizer addition).
Assumptions of multiple linear
regression
• Multiple linear regression makes all of the same assumptions as simple
linear regression:
 The probability distribution of e is normal.
 The mean of e is zero: E(e) = 0.
 The standard deviation of e is se for all values of X.
 The set of errors associated with different values of Y are all independent
Two or more independent
variables (predictor
variables).
Design
Requirements
Sample size: >= 50 (at least
10 times as many cases as
independent variables)
The formula for a multiple linear
regression is:
• y = the predicted value of the dependent variable

• B0 = the y-intercept (value of y when all other parameters are set to 0)

• B1X1= the regression coefficient (B 1) of the first independent variable (X1)


(a.k.a. the effect that increasing the value of the independent variable has
on the predicted y value)

• … = do the same for however many independent variables you are testing

• BnXn = the regression coefficient of the last independent variable

• e = model error (a.k.a. how much variation there is in our estimate of y)


best-fit line
• To find the best-fit line for each independent variable, multiple linear
regression calculates three things:
 The regression coefficients that lead to the smallest overall model error.
 The t-statistic of the overall model.
 The associated p-value (how likely it is that the t-statistic would have occurred by
chance if the null hypothesis of no relationship between the independent and
dependent variables was true).
T-test
• In statistics, the t-statistic is the ratio of the
departure of the estimated value of a parameter
from its hypothesized value to its standard error’

• It is meant for evaluating whether the two sets of


data are statistically significantly different from
each other.
• Q.1: Find the t-test value for the following given two sets of values:

• A = 7, 2, 9, 8 and

• B = 1, 2, 3, 4?

• Solution: For first data set:

• Number of terms in first set i.e. n_1 = 4

• Calculate mean value for first data set using formula:


• Higher values of the t-value, also called t-score, indicate that a large
difference exists between the two sample sets. The smaller the t-value, the
more similarity exists between the two sample sets.
P-value
• P-value is the lowest significance level that results in rejecting the null
hypothesis.
Example • Coin toss
 Two possible outcomes
 H0 = This is a fair coin
 H1 = This is not a fair coin
• The P-value test will assume that the
H0 hypothesis is true i.e., the coin is
fair
• Let us assume our threshold value to
be 5% i.e., 0.05
• Let us assume the output is
 First toss output is Tail (probability = 0.5)
 First toss output is Tail and second toss output is also Tail (probability = 0.25)
 First two outputs same as before, third toss output is also Tail (probability = 0.125)
 First three outputs same as before, fourth toss output is also Tail (probability =
0.0625)
 First four outputs same as before, fifth toss output is also Tail (probability = 0.03)
 First five outputs same as before, sixth toss output is also Tail (probability = 0.01)
 After the fourth output the statistical test is significant. Since P-value of less than
5% indicates that the hypothesis H0 is rejected and hypothesis H1 is accepted i.e.,
the coin is not fair
Selecting the independent variables
being used
• Five strategies are available for selecting the independent variables
 All in
 Backward Elimination
 Forward Selection
 Bi-directional elimination
 Score Comparison (All possible combinations)
All in
• Use all features

• Prior knowledge (Data domain expert) tell you which features to keep and
which to discard
Backward Elimination
1. Select a significance level (SL) for P-value e.g. 5% (0.05)

2. Fit the model will all predictors

3. Consider the predictor with highest P-value. If P>SL go to step 4,


otherwise include the predictor in your feature set

4. Remove variable with P>SL

5. Fit model without the variable and go to step 3 if all features have not
been exhausted. Otherwise terminate
Forward selection
Select Select a significance level (SL) for P-value e.g. 5% (0.05)

Fit Fit all the predictors y->xn one at a time and select one with the lowest P-value

Keep Keep this variable and fit all possible models with one extra predictor i.e., add one
predictor to the variables you already have.

Consider Consider the predictor with the least P-value. If P<SL, go to step 3, otherwise finish.
(keep the previous model)
Select a significance level to enter (SL_Enter) and
stay (SL_stay) in the model.

Perform the next step of forward selection (new

Bi- variables must enter if P < SL_enter)

directional Pefrom all steps of backward elimination (old


variables must have P<SL_stay to stay in the model)
Elimination No new variables can enter and no old variables can
exit

FIN: model is ready


All possible 1. Select a criterion of goodness of fit
models 2. Construct all possible models. If N
variables the 2𝑁 − 1 𝑚𝑜𝑑𝑒𝑙𝑠
3. Select the model with the best
criterion
4. Model is ready

• Very computationally intense !!!


• We will be using backward
elimination strategy
MLR Implemention
Python
Multiple Linear
Regression

Python Implementation
Importing
Dataset
Dataset
• Total 50 samples

• Three independent variables


 Administration
 Marketing spent
 State (categorical data)
 One hot encoding
 Three categories, so three
dummy variables

• One dependent variable


 Profit
Code

• One hot encoding to be applied


on column 3

• 80, 20 split
Training and testing the model
Evaluating the model
Q: Do we need to normalize the data in MLR
•A: No, we do not need to perform normalization for MLR, since
the coefficients b0, b1, b2,… in the MLR model automatically
does that.

Q: Do we need to check the assumptions of linear


Some points regression

to •A: Absolutely not, for a new dataset play and experiment with
it. If there are redundant features it will perform poorly.

remember Q: Do we need to use some strategy to avoid the


dummy variable trap
•A: The class used here in python will automatically do that

Q: Do we have to use techniques such as backward


elimination etc, before applying MLR.
•A: No, because the class we use will automatically do that.
Example of Backward Elimination
(Optional)
1. Select a significance level (SL) for P-value e.g. 5% (0.05)

2. Fit the model will all predictors

3. Consider the predictor with highest P-value. If P>SL go to step 4,


otherwise include the predictor in your feature set

4. Remove variable with P>SL

5. Fit model without the variable and go to step 3 if all features have not
been exhausted. Otherwise terminate
Code
• Importing the dataset
• Dividing the dataset into independent and dependent varia
• One hot encoding for the categorical data
• We do not need to cater the missing values as there are non
Inserting beta_0

x6 has highest p value, so it should be removed


x5 has highest p value so it should be removed
Now no independent variable has a p value >0.05, so we keep the remaining variables
Comparison between two approaches
using RMSE

The regression model with backward elimination shows lower RMSE


Plotting the output
Example of Forward
Selection(Optional)
1. Select a significance level (SL) for P-value e.g. 5% (0.05)

2. Fit all the predictors y->xn one at a time and select one with the lowest
P-value

3. Keep this variable and fit all possible models with one extra predictor i.e.,
add one predictor to the variables you already have.

4. Consider the predictor with the least P-value. If P<SL, go to step 3,


otherwise finish. (keep the previous model)

You might also like