0% found this document useful (0 votes)
13 views

Multiple Linear Regression

Uploaded by

Vrushika Doshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Multiple Linear Regression

Uploaded by

Vrushika Doshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as KEY, PDF, TXT or read online on Scribd
You are on page 1/ 26

Multiple linear regression,

stepwise Multiple linear


regression
Multiple Regression

An extension of simple linear regression.

Predicts the value of a variable based on the value of two or more


other variables.

For example: whether exam performance can be predicted based on


revision time, test anxiety, lecture attendance, and gender.

Whether daily cigarette consumption can be predicted based on


smoking duration, age when started smoking, smoking type, income,
and gender.
Multiple Regression

Multiple regression also allows us to determine the model's


overall fit (variance explained) and the relative contribution of
each of the predictors to the total variance explained.

For example, you might want to know how much of the variation
in exam performance can be explained by revision time, test
anxiety, lecture attendance and gender "as a whole", and the
"relative contribution" of each independent variable in explaining
the variance.
Assumptions
1: The dependent variable should be measured on a continuous scale.
Types of Variable. If the dependent variable is an ordinal scale, we need
to carry out ordinal regression rather than multiple regression.
2: Two or more independent variables, which can be
either continuous (i.e., an interval or ratio variable) or categorical
(i.e., an ordinal or nominal variable).
Examples of nominal variables include gender (e.g., 2 groups: male
and female)
If one of the independent variables is dichotomous and considered a
moderating variable, we need to run a Dichotomous moderator analysis.
Assumptions
3: Independence of observations (i.e., independence of residuals), - check using
the Durbin-Watson statistic.
4: There needs to be a linear relationship between
(a) the dependent variable and each of your independent variables, and
(b) the dependent variable and the independent variables collectively.
Ways to check for these linear relationships: creating scatterplots and partial
regression plots and then visually inspecting these scatterplots and partial regression
plots to check for linearity.
If the relationship displayed in scatterplots and partial regression plots is not linear, we
will have to either run a non-linear regression analysis or "transform" the data
Assumptions

5: Data needs to show homoscedasticity, where the variances

along the line of best fit remain similar as you move along the line.

6: Data must not show multicollinearity, which occurs when you

have two or more independent variables that are highly correlated.


Assumptions
7: There should be no significant outliers, high leverage points or highly
influential points.

Check for influential points using a measure of influence known as Cook's


Distance.

8: Finally, check that the residuals (errors) are approximately normally


distributed.

Two standard methods to check this assumption include using:

(a) Histogram (with a superimposed normal curve) and a Normal P-P Plot

(b) A Normal Q-Q Plot of the studentized residuals.


Results

Assumptions

The data significantly deviates from a normal distribution.


Assumption #5: Your data needs to show homoscedasticity, where the
variances along the line of best fit remain similar as you move along the
line.

The p-value associated with the Breusch-Pagan test is 0.540, which is greater
than the common significance level of 0.05.
Fail to reject the null hypothesis of homoscedasticity. This suggests that there is no
significant evidence of heteroscedasticity in the data according to the results of the
Breusch-Pagan test. Therefore, the data is more likely to exhibit
homoscedasticity, meaning that the variance of the errors is
constant across observations.
Assumption #3: Independence of observations (i.e., independence of
residuals), - check using the Durbin-Watson statistic.

There is no significant evidence of autocorrelation in the residuals.


Since the DW statistic is close to 2 (1.99), it suggests that there may be no
first-order autocorrelation in the residuals.
Additionally, the p-value of 0.720 is greater than the commonly chosen
significance level of 0.05.
There is no significant evidence of autocorrelation in the residuals of
your model.
Assumption #6: Your data must not show multicollinearity, which occurs when you
have two or more independent variables that are highly correlated.

1. VIF (Variance Inflation Factor): VIF measures


how much the variance of an estimated regression
coefficient is increased due to collinearity. It
quantifies the extent to which the variance of the
estimated regression coefficients is inflated
compared to when the predictors are not
correlated. Typically, a VIF value above 10 (some
sources suggest 5) indicates a problematic level of
collinearity.
2. Tolerance: Tolerance is the reciprocal of VIF
(1/VIF). It indicates the proportion of variance in the
predictor variable that is not explained by the other
predictor variables. Tolerance values close to 1
Results

Model Fit Measures are commonly used to assess the goodness-of-fit of a regression model.
The value of R is 0.0713, indicating a weak positive linear relationship between the
predictor variables and the response variable.
The value of R² is 0.00509, which means that only approximately 0.51% of the variance in
the dependent variable is explained by the independent variables in the model. This
suggests that the model's explanatory power is very low.
The value of Adjusted R² is 0.00461. This value is similar to R-squared but takes into
account the number of predictors in the model. The adjusted R² is slightly smaller than R²,
indicating that the inclusion of the predictor variables did not significantly improve the
model's fit after adjusting for the number of predictors.
Overall, based on these model fit measures, the regression model appears to have very weak
Model Fit Measures
1. R: The correlation coefficient (also known as Pearson's) measures the strength
and direction of the linear relationship between the predictor variables and the
response variable. It ranges from -1 to 1, where 1 indicates a perfect positive
linear relationship, -1 indicates a perfect negative linear relationship, and 0
indicates no linear relationship.
2. R² (R-squared): R-squared represents the proportion of variance in the
dependent variable that is explained by the independent variables in the
regression model. It ranges from 0 to 1, with higher values indicating that the
independent variables explain a larger proportion of the variance in the
dependent variable. In other words, it quantifies the goodness-of-fit of the
regression model.
3. Adjusted R²: Adjusted R-squared is a modified version of R-squared that
adjusts for the number of predictor variables in the model. It penalizes adding
additional predictors that do not significantly improve the model's fit. Adjusted
R-squared can be particularly useful when comparing models with different
numbers of predictors.
Predictor: The predictors are 'age' and 'BMI'.
Estimate: This column provides each predictor variable's estimated coefficients (or slopes).
These coefficients represent the expected change in the dependent variable for a one-unit
change in the corresponding predictor variable, holding all other predictors constant.
Intercept (72.0783): This represents the estimated heart rate when all other predictors are
zero. In this case, it suggests that when age and BMI are zero, the estimated heart rate is
approximately 72.0783 beats per minute.
The estimated coefficient for 'age' is -0.0319. This indicates that, on average, there is a decrease
of 0.0319 units in heart rate for every one-unit increase in age, holding other variables constant.
The coefficient for age indicates the change in heart rate for a one-unit increase in age, holding
other predictors constant. Since the p-value associated with age is 0.143, which is greater than
the common significance level of 0.05, we do not have enough evidence to conclude that age has
a significant effect on heart rate in this model.
The estimated coefficient for 'BMI' is 0.2086.
The coefficient for BMI suggests that for every one-unit increase in BMI, the estimated
heart rate increases by approximately 0.2086 beats per minute.
Since the p-value associated with BMI is less than 0.001, we can conclude that BMI has a
statistically significant effect on heart rate in this model.
Standardized Estimate: These indicate the relative importance of each predictor in the
model after standardizing the variables. It represents the change in the response variable
(heart rate) in terms of standard deviations for a one-unit change in the predictor variable.
It can be useful for comparing the relative importance of predictors in the model.
Check for Cook's Distance statistic in the regression output.Cook's Distance measures the
influence of each observation on the regression coefficients.High Cook's Distance values
(>1) indicate influential outliers that may significantly affect the regression model.

Results
e-4
The notation " e-4 " represents a number in scientific notation,
where "e" stands for "exponent" and "-4" indicates that the decimal
point is moved four places to the left. Specifically, "e-4" is
equivalent to multiplying the number by 10−410−4, or dividing it
by 10,000.
For example:
7.68e-5 is equivalent to 7.68×10−57.68×10−5, which is
0.0000768.
6.83e-4 is equivalent to 6.83×10−46.83×10−4, which is 0.000683.
So, in the context of Cook's Distance:
7.68e-5 means 0.0000768.
6.83e-4 means 0.000683.
Multiple regression with a
categorical variable
The coefficient for gender indicates the difference in estimated heart rate
between males and females.
Specifically, the coefficient of -3.0294 bpm suggests that, on average, males
have a heart rate approximately 3.0294 bpm lower than females when
controlling for other predictors in the model.
Hierarchical Regression
Hierarchical regression is a statistical method used to examine the
incremental contribution of predictor variables to the variance
explained in an outcome variable. It involves entering predictor
variables into the regression equation in a stepwise manner, typically
based on theoretical or logical considerations.
In hierarchical regression, predictors are added to the model in
separate blocks or steps, allowing researchers to assess the unique
contribution of each set of predictors while controlling for the effects of
previous sets of predictors.
This technique is commonly used to understand how additional
predictors improve the prediction of an outcome variable beyond what
is accounted for by earlier predictors.
Stepwise Multiple linear regression
Stepwise Multiple linear regression
Stepwise regression is defined as a step-by-step construction of a regression
model that includes an automatic selection of variables that are independent.

It involves systematically adding or removing predictor variables from the


model based on their statistical significance.

The process typically proceeds in one of two directions: forward selection or


backward elimination.

Stepwise regression is a useful tool for identifying a parsimonious set of


predictor variables that best explain the variability in the outcome variable
Forward Selection
1.

1. Start with an empty model (i.e., no predictor variables included).


2. Add one predictor variable at a time based on a predefined criterion,
such as the highest correlation with the outcome variable or the lowest
p-value from a univariate regression.
3. At each step, assess the statistical significance of the added variable
using a predefined significance level (e.g., p-value < 0.05).
4. Continue adding variables until no more variables meet the inclusion
criterion.
Backward Elimination:
Start with a full model (i.e., all predictor variables included).
Remove one predictor variable at a time based on a predefined
criterion, such as the highest p-value from a multivariate
regression.
At each step, assess the statistical significance of the removed
variable using a predefined significance level (e.g., p-value <
0.05).
Continue removing variables until no more variables meet the
exclusion criterion.
Stepwise Procedure
Stepwise regression alternates between forward selection and
backward elimination steps until a stopping criterion is met.
Common stopping criteria include:
A predefined number of steps.
No more variables meet the inclusion or exclusion criteria.
The model's performance (e.g., adjusted R-squared) no longer
improves significantly with additional variables.
Model Evaluation
Evaluate the final model using appropriate model fit measures
(e.g., R-squared, AIC, BIC) and diagnostic tests (e.g., residual
analysis, multicollinearity assessment).
Validate the model's performance on an independent dataset, if
available, to assess its generalizability.

You might also like