0% found this document useful (0 votes)
47 views52 pages

Regression Analysis Presentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views52 pages

Regression Analysis Presentation

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 52

Regression

Analysis
PRESENTED BY:
SHAYAN AHMED
ROLL#0052-MSCGEC-21
Definition
Regression analysis also called as multivariate analysis is a statistical technique that is
used to measure and describe the (nature of the) function relating two (or more)
variables.

Regression analysis is a powerful statistical method that allows you to examine


the relationship between two or more variables of interest. While there are many
types of regression analysis, at their core they all examine the influence of one or more
independent variables on a dependent variable.

In statistics, regression analysis is a technique which examines the relation of


a dependent variable (response variable) to specified independent
variables (explanatory variables). Regression analysis can be used as
a descriptive method of data analysis without relying on any assumptions about
underlying processes generating the data.

When paired with assumptions in the form of a statistical model, regression can be used
for prediction, inference, hypothesis testing, and modeling of causal relationships.
Uses of Regression Analysis

While there are many The main uses of


types of regression regression analysis
analysis, at their core are forecasting, time
they all examine the series modeling and
influence of one or more finding the cause and
independent variables on effect relationship
a dependent variable. between variables.
• Regression analysis is a reliable method of
identifying which variables have impact on a
topic of interest. The process of performing a
regression allows you to confidently determine
which factors matter most, which factors can
be ignored, and how these factors influence
Understandi each other.
ng • In order to understand regression analysis
fully, it’s essential to comprehend the
regression following terms:
analysis • Dependent Variable: This is the main factor
that you’re trying to understand or predict.
• Independent Variables: These are the factors
that you hypothesize have an impact on your
dependent variable.
Difference Correlation tells the
relationship between
variables but correlation
between does not predict the
variables.
correlation
and
regression For prediction regression
analysis is used.
Types of regression
analysis
Following are the types of regression analysis

 Linear regression (one IV is used to predict the value of the DV).

 Multiple regression (Two or more than two IV predict the value of the
DV).

 Logistic regression ( It gives Categorical outcome variables).


Linear regression

 Simple linear regression assess the relationship

between dependent variable and independent variable.


 Linear regression analysis is used to predict the value

of a variable based on the value of another variable.


The variable you want to predict is called the
dependent variable. The variable you are using to
predict the other variable's value is called the
independent variable.
Linear regression is the next step up after correlation. It
is used when we want to predict the value of a variable
based on the value of another variable. The variable we
want to predict is called the dependent variable (or
sometimes, the outcome variable). The variable we are
using to predict the other variable's value is called the
independent variable (or sometimes, the predictor
When to use? variable). For example, you could use linear regression to
understand whether exam performance can be predicted
based on revision time; whether cigarette consumption
can be predicted based on smoking duration and so forth.
•When you choose to analyse your data using
linear regression, part of the process involves
checking to make sure that the data you want to
analyse can actually be analysed using linear
regression. You need to do this because it is only
appropriate to use linear regression if your data
"passes" seven assumptions that are required for
Assumptions linear regression to give you a valid result.

•Before we come to these seven assumptions, we


shall know that, when analysing our data using
SPSS, one or more of these assumptions is
violated (i.e., not met). This is not uncommon
when working with real-world data rather than
textbook examples, which often only show you
how to carry out linear regression when
everything goes well.
Assumption #1
Your dependent variable should be measured at the continuous level (i.e., it is either
an interval or ratio variable). Examples of continuous variables include revision time (measured in hours),
intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg),
and so forth.
Assumption #2
Your independent variable should also be measured at the continuous level (i.e., it is either
an interval or ratio variable).
Assumption #3
There needs to be a linear relationship between the two variables. Whilst there are a number of ways to check
whether a linear relationship exists between your two variables, we suggest creating a scatterplot using SPSS
where you can plot the dependent variable against your independent variable and then visually inspect the
scatterplot to check for linearity. Your scatterplot may look something like one of the following:
Assumption #4

There should be no significant outliers. An outlier is an observed data point that has a dependent variable
value that is very different to the value predicted by the regression equation. As such, an outlier will be a point
on a scatterplot that is (vertically) far away from the regression line indicating that it has a large residual, as
highlighted below:

The problem with outliers is that they can have a negative effect on the regression analysis (e.g., reduce the fit of
the regression equation) that is used to predict the value of the dependent (outcome) variable based on the
independent (predictor) variable. This will change the output that SPSS produces and reduce the predictive
accuracy of your results. Fortunately, when using SPSS Statistics to run a linear regression on your data, you can
easily include criteria to help you detect possible outliers.
Assumption #5
You should have independence of observations, which you can easily check using the Durbin-Watson statistic, which is a simple test
to run using SPSS Statistics.

Assumption #6
Your data needs to show homoscedasticity, which is where the variances along the line of best fit remain similar as you move along
the line. Take a look at the three scatterplots below, which provide three simple examples: two of data that fail the assumption (called
heteroscedasticity) and one of data that meets this assumption (called homoscedasticity):

Assumption #7:
Finally, you need to check that the residuals (errors) of the regression line are approximately normally distributed. Two common
methods to check this assumption include using either a histogram (with a superimposed normal curve) or a Normal P-P Plot.

NOTE:
You can check assumptions #3, #4, #5, #6 and #7 using SPSS Statistics. Assumptions #3 should be checked first, before moving
onto assumptions #4, #5, #6 and #7. Testing the assumptions in this order because assumptions #3, #4, #5, #6 and #7 require you to run
the linear regression procedure in SPSS Statistics first, so it is easier to deal with these after checking assumption #1 and #2. In this way,
Importance of linear regression

• Linear-regression models are relatively simple and


provide an easy-to-interpret mathematical formula that
can generate predictions. Linear regression can be
applied to various areas of academic study.
• You’ll find that linear regression is used in everything
from biological, behavioral, environmental and social
sciences to business. Linear-regression models have
become a proven way to scientifically and reliably
predict the future.
A salesperson for a large car brand wants to determine whether

there is a relationship between an individual's income and the

price they pay for a car. As such, the individual's "income" is the
Example
independent variable and the "price" they pay for a car is the

dependent variable. The salesperson wants to use this

information to determine which cars to offer potential customers

in new areas where average income is known.


In SPSS Statistics, we created two variables so
that we could enter our data: income (the
independent variable), and price (the dependent
variable). It can also be useful to create a third
variable, to act as a chronological case number.

SPSS Setup This third variable is used to make it easy for you
to eliminate cases (e.g., significant outliers) that
you have identified when checking for
assumptions. However, we do not include it in
the SPSS Statistics procedure that follows
because we have already checked these
assumptions.
1. Click Analyze > Regression > Linear... on the top menu, as shown below:
You will be presented with the Linear Regression dialogue box:
• Transfer the independent variable, income, into the Independent(s): box and the
dependent variable, price into the Dependent: box. You can do this by either drag-
and-dropping the variables or by using the appropriate buttons. You will end up
with the following screen:
3. You now need to check four of the assumptions discussed in
the Assumptions section above: no significant outliers (assumption #3);
independence of observations (assumption #4); homoscedasticity (assumption #5);
and normal distribution of errors/residuals (assumptions #6). You can do this by
using the and

features, and then selecting the appropriate options within these two dialogue
boxes.

4. Click on the button. This will generate the results.


Output of linear regression
It will generate 3 tables of output for a linear regression.

Model summary

The first table of interest is the Model Summary table, as shown below:

This table provides the R and R2 values. The R value represents the simple correlation and is 0.873
(the "R" Column), which indicates a high degree of correlation. The R2 value (the "R Square"
column) indicates how much of the total variation in the dependent variable, price , can be explained
by the independent variable, income. In this case, 76.2% can be explained, which is very large.
ANOVA table

The next table is the ANOVA table, which reports how well the regression equation fits the data (i.e., predicts
the dependent variable) and is shown below:

This table indicates that the regression model predicts the dependent variable significantly well. How
do we know this? Look at the "Regression" row and go to the "Sig." column. This indicates the
statistical significance of the regression model that was run. Here, p < 0.0005, which is less than
0.05, and indicates that, overall, the regression model statistically significantly predicts the outcome
variable (i.e., it is a good fit for the data).
Coefficients table

The Coefficients table provides us with the necessary information to predict price
from income, as well as determine whether income contributes statistically
significantly to the model (by looking at the "Sig." column). Furthermore, we can
use the values in the "B" column under the "Unstandardized Coefficients" column,
as shown below:
APA table and Interpretation
Multiple
regression
What is multiple
regression?
►Multiple regression is an extension of simple
linear regression.
►It is used when we want to predict the value of a
variable based on the value of two or more other
variables.
►The variable we want to predict is called the
dependent variable (or sometimes, the outcome,
target or criterion variable).
►The variables we are using to predict the value
of the dependent variable are called the
independent variables (or sometimes, the
predictor, explanatory or regressor variables).
For example, we could use multiple regression to
understand whether exam performance can be predicted
based on revision time, test anxiety, lecture attendance and
Example gender. Alternately, we could use multiple regression to
understand whether daily cigarette consumption can be
predicted based on smoking duration, age when started
smoking, smoker type, income and gender.
Why multiple regression?

For example, we might want to


know how much of the variation in
Multiple regression also allows us
exam performance can be
to determine the overall fit
explained by revision time, test
(variance explained) of the model
anxiety, lecture attendance and
and the relative contribution of
gender "as a whole", but also the
each of the predictors to the total
"relative contribution" of each
variance explained.
independent variable in explaining
the variance.
Assumptions

When we choose to analyse our data using multiple regression,


part of the process involves checking to make sure that the data
we want to analyse can actually be analysed using multiple
regression.

Assumption #1:

Your dependent variable should be measured on a continuous scale (i.e., it is either an interval or ratio variable). Examples of variables that meet this criterion
include revision time (measured in hours), intelligence (measured using IQ score), exam performance (measured from 0 to 100), weight (measured in kg), and
so forth. If your dependent variable was measured on an ordinal scale, you will need to carry out ordinal regression rather than multiple regression. Examples
of ordinal variables include Likert items (e.g., a 7-point scale from "strongly agree" through to "strongly disagree"), amongst other ways of ranking categories
(e.g., a 3-point scale explaining how much a customer liked a product, ranging from "Not very much" to "Yes, a lot").

Assumption #2:

You have two or more independent variables, which can be either continuous (i.e., an interval or ratio variable) or categorical (i.e.,
an ordinal or nominal variable). Examples of nominal variables include gender (e.g., 2 groups: male and female), ethnicity (e.g., 3 groups: Caucasian, African
American and Hispanic), physical activity level (e.g., 4 groups: sedentary, low, moderate and high), profession (e.g., 5 groups: surgeon, doctor, nurse, dentist,
therapist), and so forth.
Assumption #3:

You should have independence of observations (i.e., independence of residuals), which you can easily
check using the Durbin-Watson statistic, which is a simple test to run using SPSS Statistics.

Assumption #4:

There needs to be a linear relationship between (a) the dependent variable and each of your independent
variables, and (b) the dependent variable and the independent variables collectively. Whilst there are a
number of ways to check for these linear relationships, we suggest creating scatterplots and partial
regression plots using SPSS Statistics, and then visually inspecting these scatterplots and partial regression
plots to check for linearity.
Assumption #5:

Your data needs to show homoscedasticity, which is where the


variances along the line of best fit remain similar as you move
along the line.

Assumption #6:

Your data must not show multicollinearity, which occurs when


you have two or more independent variables that are highly
correlated with each other.
Assumption #7:

There should be no significant outliers, high leverage points or highly influential points. Outliers, leverage and
influential points are different terms used to represent observations in your data set that are in some way unusual when you
wish to perform a multiple regression analysis. These different classifications of unusual points reflect the different impact
they have on the regression line. An observation can be classified as more than one type of unusual point. However, all these
points can have a very negative effect on the regression equation that is used to predict the value of the dependent variable
based on the independent variables. This can change the output that SPSS produces and reduce the predictive accuracy of
your results as well as the statistical significance. Fortunately, when using SPSS Statistics to run multiple regression on your
data, you can detect possible outliers, high leverage points and highly influential points.
Assumption #8:
Finally, you need to check that the residuals (errors) are approximately normally distributed. Two common methods to
check this assumption include using:
•(a) a histogram (with a superimposed normal curve) and a Normal P-P Plot or
•(b) a Normal Q-Q Plot of the studentized residuals.

NOTE

You can check assumptions #3, #4, #5, #6, #7 and #8 using SPSS Statistics. Assumptions #1 and #2 should be
checked first, before moving onto assumptions #3, #4, #5, #6, #7 and #8.
A health researcher wants to be able to predict "VO2max", an
indicator of fitness and health. Normally, to perform this
procedure requires expensive laboratory equipment and
necessitates that an individual exercise to their maximum. This
can put off those individuals who are not very active/fit and

Example those individuals who might be at higher risk of ill health. For
these reasons, it has been desirable to find a way of predicting
an individual's VO2max based on attributes that can be
measured more easily and cheaply. To this end, a researcher
recruited 100 participants to perform a maximum VO2max test,
but also recorded their "age", "weight", "heart rate" and
"gender". Heart rate is the average of the last 5 minutes of a 20
minute, much easier, lower workload cycling test. The
researcher's goal is to be able to predict VO2max based on these
four attributes: age, weight, heart rate and gender.
SPSS Setup

In SPSS, we created six variables: (1) VO2max , which is the maximal aerobic
capacity, (2) age, which is the participant's age, (3) weight, which is the
participant's weight (4) heart rate, which is the participant's heart rate, (5) gender,
which is the participant's gender and (6) caseno, which is the case number.
This variable is used to make it easy for you to eliminate cases (e.g., "significant
outliers", "high leverage points" and "highly influential points") that you have
identified when checking for assumptions.
Steps in SPSS
The seven steps below show you how to analyse
your data using multiple regression in SPSS are

1. Click Analyze > Regression > Linear on


the main menu, as shown below:
2.You will be presented with
the Linear Regression dialogue
box below:
3. Transfer the dependent variable,VO2 max, into the Dependent: box and the independent variables, age,
weight, heart rate and gender in the Independent(s): box, using the buttons , as shown below (all other
boxes can be ignored):
4. Click on the button. You will be presented with the Linear
Regression: Statistics dialogue box, as shown below:
5. In addition to the options that are selected
by default, select Confidence intervals in
the Regression Coefficients area leaving
the Level(%): option at "95". You will end up
with the following screen:
6. Click on the button. You will be returned to the Linear
Regression dialogue box.

7. Click on the button. It will generate the output.


Reporting of the output
SPSS will generate 3 tables of output for a multiple regression analysis.
Model Summary table

The first table of interest is the Model Summary table. This table provides the R, R2, adjusted R2, and the standard
error of the estimate, which can be used to determine how well a regression model fits the data:

The "R" column represents the value of R, the multiple correlation coefficient. R can be considered to be one
measure of the quality of the prediction of the dependent variable; in this case, VO2max. A value of 0.760, in this
example, indicates a good level of prediction. The "R Square" column represents the R2 value (also called the
coefficient of determination), which is the proportion of variance in the dependent variable that can be explained
by the independent variables (it is the proportion of variation accounted for by the regression model above and
beyond the mean model). You can see from our value of 0.577 that our independent variables explain 57.7% of the
variability of our dependent variable. However, you also need to be able to interpret " Adjusted R Square" (adj.
R2) to accurately report your data.
ANOVA table

The F-ratio in the ANOVA table (see below)


tests whether the overall regression model is a
good fit for the data. The table shows that the
independent variables statistically
significantly predict the dependent
variable, F(4, 95) = 32.393, p < .0005 (i.e.,
the regression model is a good fit of the data).
Coefficients table

Unstandardized coefficients
indicate how much the dependent
variable varies with an independent
variable when all other independent
variables are held constant.
Consider the effect of age in this
example. The unstandardized
coefficient, B1, for age is equal to -
0.165. This means that for each one
year increase in age, there is a
decrease in VO2max of 0.165
ml/min/kg.
Statistical significance of the independent
variables

You can test for the statistical significance of each of the independent variables. This tests whether the
unstandardized (or standardized) coefficients are equal to 0 (zero) in the population. If p < .05, you can
conclude that the coefficients are statistically significantly different to 0 (zero). The t-value and
corresponding p-value are located in the "t" and "Sig." columns, respectively, as highlighted below:

You can see from the "Sig." column that all independent variable coefficients are statistically significantly
different from 0 (zero). Although the intercept, B 0, is tested for statistical significance, this is rarely an important
or interesting finding.
APA table
Interpretation

A multiple regression was run to predict VO2max from gender, age,

weight and heart rate. These variables statistically significantly predicted

VO2max, F(4, 95) = 32.393, p < .0005, R2 = .577. All four variables added

statistically significantly to the prediction, p < .05.


Standard multiple regression
 In standard multiple regression, all the independent (or
Different methods predictor) variables are entered into the equation
of multiple simultaneously.
 Each independent variable is evaluated in terms of its
regression predictive power, over and above that offered by all the other
independent variables. This is the most commonly used
multiple regression analysis.
 We can use this approach if we have a set of variables (e.g.
various personality scales) and wanted to know how much
variance in a dependent variable (e.g. anxiety) they were able to
explain as a group or block. This approach would also tell you
how much unique variance in the dependent variable each of
the independent variables explained.
Stepwise multiple regression
In stepwise regression, the
researcher provides a list of
It is a method of In each step, a variable is
independent variables and then
fitting regression models in considered for addition to or
allows the program to select
which the choice of predictive subtraction from the set of
which variables it will enter and
variables is carried out by an explanatory variables based on
in which order they go into the
automatic procedure. some pre-specified criterion.
equation, based on a set of
statistical criteria.

There are three different


versions of this approach:
forward selection, backward
deletion and stepwise
regression. There are a number
of problems with these
approaches and some
controversy in the literature
concerning their use and abuse.

SPSS command:
Analyze regression linear method(stepwise) ok
Hierarchical regression
Hierarchical regression is a way to show if variables of your interest explain a statistically significant amount of variance in your Dependent Variable
(DV) after accounting for all other variables. This is a framework for model comparison (in theory) rather than a statistical method.

Analysis steps in SPSS


Run a hierarchical regression by entering the predictors in a set of blocks with Method = Enter, as follows:
Enter the predictor(s) for the first block into the IV box in the dialog box. Leave Method set at 'Enter'. Then click the
'Next' button at the top of the 'Independent(s)' box. This clears that box.
Enter the variable(s) for block 2 in your model. If there is a third block, click the 'Next' button again to clear the second
block variables from the box and enter the 3rd block variable(s).
Continue this sequence for as many blocks as needed. You don't need to click 'Next' after entering the variable(s) for the
last block.
Click the 'Statistics' button in the main Linear Regression dialog box. In the Statistics dialog, check R squared change.
This will request a test of the significance of the change in R squared at each successive block in the model.
Example
To evaluate the ability of the model
(which includes Total Mastery and
Total PCOISS) to predict perceived
stress scores, after controlling for a
number of additional variables (age,
social desirability).
Reporting the results:
APA Table format
References

•https://lms.su.edu.pk/download?filename=1588697869-julie-pallant-spss-survival-manual-mcgraw-hill-house-2

016-1.pdf&lesson=17247

•https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php

•https://www.spss-tutorials.com/basics/

•https://www.ibm.com/support/pages/hierarchical-regression-spss

•https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-statistics.php

•https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-statistics.php

You might also like