Regression Analysis Presentation
Regression Analysis Presentation
Analysis
PRESENTED BY:
SHAYAN AHMED
ROLL#0052-MSCGEC-21
Definition
Regression analysis also called as multivariate analysis is a statistical technique that is
used to measure and describe the (nature of the) function relating two (or more)
variables.
When paired with assumptions in the form of a statistical model, regression can be used
for prediction, inference, hypothesis testing, and modeling of causal relationships.
Uses of Regression Analysis
Multiple regression (Two or more than two IV predict the value of the
DV).
There should be no significant outliers. An outlier is an observed data point that has a dependent variable
value that is very different to the value predicted by the regression equation. As such, an outlier will be a point
on a scatterplot that is (vertically) far away from the regression line indicating that it has a large residual, as
highlighted below:
The problem with outliers is that they can have a negative effect on the regression analysis (e.g., reduce the fit of
the regression equation) that is used to predict the value of the dependent (outcome) variable based on the
independent (predictor) variable. This will change the output that SPSS produces and reduce the predictive
accuracy of your results. Fortunately, when using SPSS Statistics to run a linear regression on your data, you can
easily include criteria to help you detect possible outliers.
Assumption #5
You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test
to run using SPSS Statistics.
Assumption #6
Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along
the line. Take a look at the three scatterplots below, which provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):
Assumption #7:
Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed. Two common
methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot.
NOTE:
You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. Assumptions #3 should be checked first, before moving
onto assumptions #4, #5, #6 and #7. Testing the assumptions in this order because assumptions #3, #4, #5, #6 and #7 require you to run
the linear regression procedure in SPSS Statistics first, so it is easier to deal with these after checking assumption #1 and #2. In this way,
Importance of linear regression
price they pay for a car. As such, the individual's "income" is the
Example
independent variable and the "price" they pay for a car is the
SPSS Setup This third variable is used to make it easy for you
to eliminate cases (e.g., significant outliers) that
you have identified when checking for
assumptions. However, we do not include it in
the SPSS Statistics procedure that follows
because we have already checked these
assumptions.
1. Click Analyze > Regression > Linear... on the top menu, as shown below:
You will be presented with the Linear Regression dialogue box:
• Transfer the independent variable, income, into the Independent(s): box and the
dependent variable, price into the Dependent: box. You can do this by either drag-
and-dropping the variables or by using the appropriate buttons. You will end up
with the following screen:
3. You now need to check four of the assumptions discussed in
the Assumptions section above: no significant outliers (assumption #3);
independence of observations (assumption #4); homoscedasticity (assumption #5);
and normal distribution of errors/residuals (assumptions #6). You can do this by
using the and
features, and then selecting the appropriate options within these two dialogue
boxes.
Model summary
The first table of interest is the Model Summary table, as shown below:
This table provides the R and R2 values. The R value represents the simple correlation and is 0.873
(the "R" Column), which indicates a high degree of correlation. The R2 value (the "R Square"
column) indicates how much of the total variation in the dependent variable, price , can be explained
by the independent variable, income. In this case, 76.2% can be explained, which is very large.
ANOVA table
The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts
the dependent variable) and is shown below:
This table indicates that the regression model predicts the dependent variable significantly well. How
do we know this? Look at the "Regression" row and go to the "Sig." column. This indicates the
statistical significance of the regression model that was run. Here, p < 0.0005, which is less than
0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome
variable (i.e., it is a good fit for the data).
Coefficients table
The Coefficients table provides us with the necessary information to predict price
from income, as well as determine whether income contributes statistically
significantly to the model (by looking at the "Sig." column). Furthermore, we can
use the values in the "B" column under the "Unstandardized Coefficients" column,
as shown below:
APA table and Interpretation
Multiple
regression
What is multiple
regression?
►Multiple regression is an extension of simple
linear regression.
►It is used when we want to predict the value of a
variable based on the value of two or more other
variables.
►The variable we want to predict is called the
dependent variable (or sometimes, the outcome,
target or criterion variable).
►The variables we are using to predict the value
of the dependent variable are called the
independent variables (or sometimes, the
predictor, explanatory or regressor variables).
For example, we could use multiple regression to
understand whether exam performance can be predicted
based on revision time, test anxiety, lecture attendance and
Example gender. Alternately, we could use multiple regression to
understand whether daily cigarette consumption can be
predicted based on smoking duration, age when started
smoking, smoker type, income and gender.
Why multiple regression?
Assumption #1:
Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion
include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and
so forth. If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples
of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories
(e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").
Assumption #2:
You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e.,
an ordinal or nominal variable). Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African
American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist,
therapist), and so forth.
Assumption #3:
You should have independence of observations (i.e., independence of residuals), which you can easily
check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics.
Assumption #4:
There needs to be a linear relationship between (a) the dependent variable and each of your independent
variables, and (b) the dependent variable and the independent variables collectively. Whilst there are a
number of ways to check for these linear relationships, we suggest creating scatterplots and partial
regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression
plots to check for linearity.
Assumption #5:
Assumption #6:
There should be no significant outliers, high leverage points or highly influential points. Outliers, leverage and
influential points are different terms used to represent observations in your data set that are in some way unusual when you
wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact
they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these
points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable
based on the independent variables. This can change the output that SPSS produces and reduce the predictive accuracy of
your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your
data, you can detect possible outliers, high leverage points and highly influential points.
Assumption #8:
Finally, you need to check that the residuals (errors) are approximately normally distributed. Two common methods to
check this assumption include using:
•(a) a histogram (with a superimposed normal curve) and a Normal P-P Plot or
•(b) a Normal Q-Q Plot of the studentized residuals.
NOTE
You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be
checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8.
A health researcher wants to be able to predict "VO2max", an
indicator of fitness and health. Normally, to perform this
procedure requires expensive laboratory equipment and
necessitates that an individual exercise to their maximum. This
can put off those individuals who are not very active/fit and
Example those individuals who might be at higher risk of ill health. For
these reasons, it has been desirable to find a way of predicting
an individual's VO2max based on attributes that can be
measured more easily and cheaply. To this end, a researcher
recruited 100 participants to perform a maximum VO2max test,
but also recorded their "age", "weight", "heart rate" and
"gender". Heart rate is the average of the last 5 minutes of a 20
minute, much easier, lower workload cycling test. The
researcher's goal is to be able to predict VO2max based on these
four attributes: age, weight, heart rate and gender.
SPSS Setup
In SPSS, we created six variables: (1) VO2max , which is the maximal aerobic
capacity, (2) age, which is the participant's age, (3) weight, which is the
participant's weight (4) heart rate, which is the participant's heart rate, (5) gender,
which is the participant's gender and (6) caseno, which is the case number.
This variable is used to make it easy for you to eliminate cases (e.g., "significant
outliers", "high leverage points" and "highly influential points") that you have
identified when checking for assumptions.
Steps in SPSS
The seven steps below show you how to analyse
your data using multiple regression in SPSS are
The first table of interest is the Model Summary table. This table provides the R, R2, adjusted R2, and the standard
error of the estimate, which can be used to determine how well a regression model fits the data:
The "R" column represents the value of R, the multiple correlation coefficient. R can be considered to be one
measure of the quality of the prediction of the dependent variable; in this case, VO2max. A value of 0.760, in this
example, indicates a good level of prediction. The "R Square" column represents the R2 value (also called the
coefficient of determination), which is the proportion of variance in the dependent variable that can be explained
by the independent variables (it is the proportion of variation accounted for by the regression model above and
beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the
variability of our dependent variable. However, you also need to be able to interpret " Adjusted R Square" (adj.
R2) to accurately report your data.
ANOVA table
Unstandardized coefficients
indicate how much the dependent
variable varies with an independent
variable when all other independent
variables are held constant.
Consider the effect of age in this
example. The unstandardized
coefficient, B1, for age is equal to -
0.165. This means that for each one
year increase in age, there is a
decrease in VO2max of 0.165
ml/min/kg.
Statistical significance of the independent
variables
You can test for the statistical significance of each of the independent variables. This tests whether the
unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can
conclude that the coefficients are statistically significantly different to 0 (zero). The t-value and
corresponding p-value are located in the "t" and "Sig." columns, respectively, as highlighted below:
You can see from the "Sig." column that all independent variable coefficients are statistically significantly
different from 0 (zero). Although the intercept, B 0, is tested for statistical significance, this is rarely an important
or interesting finding.
APA table
Interpretation
VO2max, F(4, 95) = 32.393, p < .0005, R2 = .577. All four variables added
SPSS command:
Analyze regression linear method(stepwise) ok
Hierarchical regression
Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable
(DV) after accounting for all other variables. This is a framework for model comparison (in theory) rather than a statistical method.
•https://lms.su.edu.pk/download?filename=1588697869-julie-pallant-spss-survival-manual-mcgraw-hill-house-2
016-1.pdf&lesson=17247
•https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php
•https://www.spss-tutorials.com/basics/
•https://www.ibm.com/support/pages/hierarchical-regression-spss
•https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php
•https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php