Multiple Regression Analysis
Multiple Regression – Basic Relationships
Purpose of multiple regression
Different types of multiple regression
Standard multiple regression
Hierarchical multiple regression
Stepwise multiple regression
Steps in solving regression problems
Purpose of multiple regression
• The purpose of multiple regression is to analyze
the relationship between metric or
dichotomous independent variables and a
metric dependent variable.
• If there is a relationship, using the information
in the independent variables will improve our
accuracy in predicting values for the dependent
variable.
Types of multiple regression
• There are three types of multiple regression, each
of which is designed to answer a different
question:
– Standard multiple regression is used to evaluate the
relationships between a set of independent variables and a
dependent variable.
– Hierarchical, or sequential, regression is used to examine the
relationships between a set of independent variables and a
dependent variable, after controlling for the effects of some
other independent variables on the dependent variable.
– Stepwise, or statistical, regression is used to identify the subset
of independent variables that has the strongest relationship to a
dependent variable.
Standard multiple regression
• In standard multiple regression, all of the independent
variables are entered into the regression equation at
the same time
• Multiple R and R² measure the strength of the
relationship between the set of independent variables
and the dependent variable. An F test is used to
determine if the relationship can be generalized to the
population represented by the sample.
• A t-test is used to evaluate the individual relationship
between each independent variable and the
dependent variable.
Hierarchical multiple regression
• In hierarchical multiple regression, the independent
variables are entered in two stages.
• In the first stage, the independent variables that we
want to control for are entered into the regression. In
the second stage, the independent variables whose
relationship we want to examine after the controls
are entered.
• A statistical test of the change in R² from the first
stage is used to evaluate the importance of the
variables entered in the second stage.
Stepwise multiple regression
• Stepwise regression is designed to find the most
parsimonious set of predictors that are most
effective in predicting the dependent variable.
• Variables are added to the regression equation
one at a time, using the statistical criterion of
maximizing the R² of the included variables.
• When none of the possible addition can make a
statistically significant improvement in R², the
analysis stops.
Assumptions of MRA
• There are certain assumptions about the error
terms that ought to hold good for the regression
equation to be useful for drawing conclusions
from it or using it for prediction purposes.
• These are : (i) The distribution of ei s is normal
• (ii) E (ei) = 0
• (iii) Var (ei ) = σ 2 for all values of i
• (iv) r (ei , ej ) = 0
Assumptions of MRA
• (i) The distribution of ei s is normal
• The implication of this assumption is that the
errors are symmetrical with both positive and
negative values.
• (ii) E (ei) = 0
• This assumption implies that the sum of
positive and negative errors is zero, and thus
they cancel out each other.
Assumptions of MRA
• (iii) Var (ei ) = σ 2 for all values of i
• This assumption means that the variance or
fluctuations in all error terms are of the same
magnitude. (Homoscedasticity)
• (iv) r (ei , ej ) = 0
• This assumption implies that the error terms
are uncorrelated with each other, i.e. one error
term does not influence the other error term.
Assumptions of MRA
• Heteroscedasticity
• The term means "differing variance" and comes from the Greek "hetero"
('different') and "skedasis" ('dispersion').
• When using some statistical techniques, such as ordinary least squares
(OLS), a number of assumptions are typically made. One of these is that
the error term has a constant variance. This will be true if the
observations of the error term are assumed to be drawn from identical
distributions. Heteroscedasticity is a violation of this assumption.
• For example, the error term could vary or increase with each
observation, something that is often the case with cross-sectional or
time series measurements. Heteroscedasticity is often studied as part of
econometrics, which frequently deals with data exhibiting it.
• With the advent of robust standard errors allowing for inference without
specifying the conditional second moment of error term, testing
conditional homoscedasticity is not as important as in the past
Assumptions of MRA
• The econometrician Robert Engle won the 2003
Nobel Memorial Prize for Economics for his studies on
regression analysis in the presence of heteroscedasticity.
• Consequences
• Heteroskedasticity does not cause OLS coefficient
estimates to be biased. However, the variance (and, thus,
standard errors) of the coefficients tends to be
underestimated, inflating t-scores and sometimes making
insignificant variables appear to be statistically
significant.
Assumptions of MRA
• Examples
• Heteroscedasticity often occurs when there is a large
difference among the sizes of the observations.
• The classic example of heteroscedasticity is that of income
versus food consumption. As one's income increases, the
variability of food consumption will increase. A poorer
person will spend a rather constant amount by always
eating essential food; a wealthier person may, in addition
to essential food, occasionally spend on expensive meal.
Those with higher incomes display a greater variability of
food consumption.
Ideally, residuals are randomly scattered around 0 (the
horizontal line) providing a relatively even distribution.
Heteroscedasticity is indicated when the residuals are
not evenly scattered around the line.
Heteroscedasticity is indicated when the residuals are
not evenly scattered around the line.
Assumptions of MRA
• Detection of Multicollinearity
• Indicators that multicollinearity may be present in a model:
• 1) Large changes in the estimated regression coefficients when a
predictor variable is added or deleted
• 2) Insignificant regression coefficients for the affected variables
in the multiple regression, but a rejection of the hypothesis that
those coefficients are insignificant as a group (using a F-test)
• 3) Large changes in the estimated regression coefficients when
an observation is added or deleted
• A formal detection-tolerance or the variance inflation factor
(VIF) for multicollinearity is:
• Tolerance = 1 – R2 VIF = 1 / Tolerance
• A tolerance of less than 0.20 and/or a VIF of 5 and above
indicates a multicollinearity problem
Assumptions of MRA
• Consequences of Multicollinearity
• In the presence of multicollinearity, the estimate of one variable's
impact on y while controlling for the others tends to be less precise
than if predictors were uncorrelated with one another. The usual
interpretation of a regression coefficient is that it provides an
estimate of the effect of a one unit change in an independent
variable, X1, holding the other variables constant. If X1 is highly
correlated with another independent variable, X2, in the given data
set, then we only have observations for which X1 and X2 have a
particular relationship (either positive or negative). We don't have
observations for which X1 changes independently of X2, so we have
an imprecise estimate of the effect of independent changes in X1.
Assumptions of MRA
• Remedy to Multicollinearity
• Multicollinearity does not actually bias results, it just produces large standard
errors in the related independent variables. With enough data, these errors will be
reduced.[1]
• One May
• 1) Leave the model as is, despite multicollinearity. The presence of multicollinearity
doesn't affect the fitted model provided that the predictor variables follow the
same pattern of multicollinearity as the data on which the regression model is
based.
• 2) Drop one of the variables. An explanatory variable may be dropped to produce a
model with significant coefficients. However, you lose information (because you've
dropped a variable). Omission of a relevant variable results in biased coefficient
estimates for the remaining explanatory variables.
• 3) Obtain more data. This is the preferred solution. More data can produce more
precise parameter estimates (with lower standard errors).
• Note: Multicollinearity does not impact the reliability of the forecast, but rather
impacts the interpretation of the explanatory variables. As long as the collinear
relationships in your independent variables remain stable over time,
multicollinearity will not affect the forecast.
Assumptions of MRA
• Autocorrelation
• In regression analysis using time series data,
autocorrelation of the residuals ("error terms")
econometrics) is a problem.
• Autocorrelation violates the ordinary least squares
(OLS) assumption that the error terms are uncorrelated.
While it does not bias the OLS coefficient estimates, the
standard errors tend to be underestimated (and the t-
scores overestimated) when the autocorrelations of the
errors at low lags are positive.
• The traditional test for the presence of first-order
autocorrelation is the Durbin–Watson statistic
Problem 1 - standard multiple regression
Sales of
Sales of Non-Food
Supermarket Net Profit Food Items Items
1 5.6 20 5
2 4.7 15 5
3 5.4 18 6
4 5.5 20 5
5 5.1 16 6
6 6.8 25 6
7 5.8 22 4
8 8.2 30 7
9 5.8 24 3
10 6.2 25 4
Request a standard multiple regression
To compute a multiple
regression in SPSS, select
the Regression | Linear
command from the Analyze
menu.
SW388R7
Data Analysis & Computers II
Specify the statistics output options
First, mark the
checkboxes for
Estimates on
the Regression
Coefficients
panel.
Third, click on
Second, mark the Continue
the checkboxes button to close
for Model Fit and the dialog box.
Descriptives.
SAMPLE SIZE
Descriptive Statistics
Mean Std. Deviation N
Net Profit 5.910 .9882 10
Sales of Food Items 21.50 4.625 10
Sales of Non-Food Items 5.10 1.197 10
The minimum ratio of
valid cases to
independent variables
for multiple regression
is 5 to 1. which is just
5:1 in this case
SW388R7
Data Analysis & Computers II
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND
DEPENDENT VARIABLES - 1
The probability of the F statistic (606.367) for the
overall regression relationship is <0.001, less than or
equal to the level of significance of 0.05. We reject
the null hypothesis that there is no relationship
between the set of independent variables and the
dependent variable (R² = 0). We support the
research hypothesis that there is a statistically
significant relationship between the set of
independent variables and the dependent variable.
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 8.739 2 4.369 606.367 .000a
Residual .050 7 .007
Total 8.789 9
a. Predictors: (Constant), Sales of Non-Food Items, Sales of Food Items
b. Dependent Variable: Net Profit
SW388R7
Data Analysis & Computers II
OVERALL RELATIONSHIP BETWEEN
INDEPENDENT AND DEPENDENT VARIABLES - 2
The Multiple R for the relationship between the set of
independent variables and the dependent variable is 0.997,
which would be characterized as very strong using the rule of
thumb than a correlation less than or equal to 0.20 is
characterized as very weak; greater than 0.20 and less than
or equal to 0.40 is weak; greater than 0.40 and less than or
equal to 0.60 is moderate; greater than 0.60 and less than or
equal to 0.80 is strong; and greater than 0.80 is very strong.
Model Summaryb
Adjusted Std. Error of Durbin-
Model R R Square R Square the Estimate Watson
1 .997 a .994 .993 .0849 1.790
a. Predictors: (Constant), Sales of Non-Food Items, Sales of Food Items
SW388R7 b. Dependent Variable: Net Profit
Data Analysis & Computers II
The Durbin Watson statistics for
this data is 1.790 which is with
in the required range (1.5 to
2.5) which means that the
errors are uncorrelated there is
no autocorrelation
Model Summaryb
Adjusted Std. Error of Durbin-
Model R R Square R Square the Estimate Watson
1 .997 a .994 .993 .0849 1.790
a. Predictors: (Constant), Sales of Non-Food Items, Sales of Food Items
b. Dependent Variable: Net Profit
RELATIONSHIP OF INDIVIDUAL INDEPENDENT
VARIABLES TO DEPENDENT VARIABLE - 1
For the independent variable Sale of food items, the
probability of the t statistic (31.998) for the b
coefficient is <0.001 which is less than or equal to the
level of significance of 0.05. We reject the null
hypothesis that the slope associated with sale of food
items is equal to zero (b = 0) and conclude that there
is a statistically significant relationship between the
profits and sale of food items and sale of nonfood
items.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) .233 .176 1.322 .228 -.184 .649
Sales of Food Items .196 .006 .917 31.998 .000 .182 .211
Sales of Non-Food Items .287 .024 .347 12.120 .000 .231 .343
SW388R7
a. Dependent Variable: Net Profit
Data Analysis & Computers II
The equation in this case can be
Net profit = 0.233+ 0.196*Sale of food items+ 0.287*Sale of non food items
The coefficient for Sale of food item = 0.196 which means that every crore
increase in sale of food items the profit is increased by 0.196 crores
The coefficient for Sale of non food item = 0.287 which means that every
crore increase in sale of food items the profit is increased by 0.287 crores
Both these coefficients are significant(sig <0.05) implies that the coefficients
are nonzero.
95% confidence interval is given in the table indicating that the population
coefficient will be within this range 95% of times. These are used to find
95% CI prediction for the dependent variable, in this case it is Net Profit. For
example in the given example, if it is required to find 95% CI for net profit
given that the sales from food items is 30 crores and sales from non food is
6 Crores.
Lower Bound = -0.184 + 30*0.182 + 6 * 0.231 = 6.662 Crores
Upper Bound = 0.649+ 30*0.211 + 6 * 0.343 = 9.037 Crores
This means that for these values of Ivs the Net Profit will be in the range of
(6.662, 9.037) for 95% of times
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) .233 .176 1.322 .228 -.184 .649
Sales of Food Items .196 .006 .917 31.998 .000 .182 .211
Sales of Non-Food Items .287 .024 .347 12.120 .000 .231 .343
a. Dependent Variable: Net Profit
Answer to problem 1
• The independent and dependent variables were
metric
• The ratio of cases to independent variables was 5
to 1.
• The overall relationship was statistically significant
and its strength was characterized correctly.
• The b coefficient for all variables was statistically
significant and the direction of the relationships
were characterized correctly.
SW388R7
Data Analysis & Computers II
Assumptions
• Assumption of Linearity
Partial Regression Plot
Dependent Variable: Net Profit
2.0
1.0
Net Profit
0.0
-1.0
-2.0
-10 -5 0 5 10
Sales of Food Items
Assumption of Linearity
Partial Regression Plot
Dependent Variable: Net Profit
0.5
0.25
Net Profit
0.0
-0.25
-0.5
-0.75
-3 -2 -1 0 1 2
Sales of Non-Food Items
• Assumption of Normality of Variables
One-Sample Kolmogorov-Smirnov Test
Sales of
Sales of Non-Food
Net Profit Food Items Items
N 10 10 10
Normal Parametersa,b Mean 5.910 21.50 5.10
Std. Deviation .9882 4.625 1.197
Most Extreme Absolute .244 .127 .174
Differences Positive .244 .127 .133
Negative -.110 -.106 -.174
Kolmogorov-Smirnov Z .773 .402 .550
Asymp. Sig. (2-tailed) .589 .997 .923
a. Test distribution is Normal.
b. Calculated from data.
• Assumption of Normality of Residual
One-Sample Kolmogorov-Smirnov Test
Unstandardiz
ed Residual
N 10
Normal Parametersa,b Mean .0000000
Std. Deviation .07486257
Most Extreme Absolute .165
Differences Positive .165
Negative -.154
Kolmogorov-Smirnov Z .523
Asymp. Sig. (2-tailed) .947
a. Test distribution is Normal.
b. Calculated from data.
• Assumption of Homoscedasticity
Scatterplot
Dependent Variable: Net Profit
R e g re s s io n S ta n d a rd ize d R e s id u a l
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
5.0 6.0 7.0 8.0
Net Profit
• Multicolinearity
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Tolerance VIF
1 (Constant) .233 .176 1.322 .228 -.184 .649
Sales of Food Items .196 .006 .917 31.998 .000 .182 .211 .997 1.003
Sales of Non-Food Items .287 .024 .347 12.120 .000 .231 .343 .997 1.003
a. Dependent Variable: Net Profit
Tolarence >0.2 => no multicolinearity
Tolerance of variable Xi is 1-Ri2* where Ri2* is
coefficient of determination for the prediction
of variable i by the other predictor variables.
Tolerance value near to 0 indicate that the
variable is highly predicted (colinear) with
other predictor variables
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B Collinearity Statistics
Model B Std. Error Beta t Sig. Lower Bound Upper Bound Tolerance VIF
1 (Constant) .233 .176 1.322 .228 -.184 .649
Sales of Food Items .196 .006 .917 31.998 .000 .182 .211 .997 1.003
Sales of Non-Food Items .287 .024 .347 12.120 .000 .231 .343 .997 1.003
a. Dependent Variable: Net Profit
VIF – Variance Inflation Factor should be less
than 5
Problem 2 – Stepwise Regression
Sr. No Company M-Cap’ Net Sales Net Profit P/E
05Amount S05
1Infosys Technologies 68560 7836 2170.9 32
2Tata Consultancy Services 67912 8051 1831.4 30
3Wipro 52637 8211 1655.8 31
4Bharti Tele-Ventures * 60923 9771 1753.5 128
5Itc 44725 8422 2351.3 20
6Hiro Honda Motors 14171 8086 868.4 16
7Satyam Computer Services 18878 3996 844.8 23
8Hdfc 23625 3758 1130.1 21
9TATA MOTORS 18881 18363 1314.9 14
10SIEMENS 7848 2753 254.7 38
11ONGC 134571 37526 14748.1 9
12TATA STEEL 19659 14926 37686 5
13STEEL AUTHORITY OF INDIA 21775 29556 6442.8 3
14NESTLE INDIA 8080 2426 311.9 27
15BHARAT GORGE CO. 6862 1412 190.5 37
16RELIANCE INDUSTRIES 105634 74108 9174 13
17HDFC BANK 19822 3563 756.5 27
18BHARAT HEAVY ELECTRICALS 28006 11200 1210.1 25
19ICICI BANK 36890 11195 2242.4 16
20MARUTI UDYOG 15767 11601 988.2 17
21SUN PHARMACEUTICALS 11413 1397 412.2 30
Assumption of Normality
One-Sample Kolmogorov-Smirnov Test
M-Cap’ Net Sales
05Amount S05 Net Profit P/E
N 21 21 21 21
Normal Parametersa,b Mean 37459.00 13245.57 4206.60 26.76
Std. Deviation 33926.159 16613.797 8438.318 25.175
Most Extreme Absolute .230 .301 .397 .280
Differences Positive .230 .301 .397 .280
Negative -.184 -.238 -.317 -.173
Kolmogorov-Smirnov Z 1.053 1.381 1.817 1.283
Asymp. Sig. (2-tailed) .218 .044 .003 .074
a. Test distribution is Normal.
b. Calculated from data.
Net Sales and Net Profit
are Not Normal
Outlier Analysis
• Since the Normality is not followed we will
conduct the Outlier Analysis
• Most common Outlier Analysis done for MRA
is using Cook’s Distance.
Request a multiple regression
To compute a multiple
regression in SPSS, select
the Regression | Linear
command from the Analyze
menu.
SW388R7
Data Analysis & Computers II
Specify variables and method for selecting variables
First, Select the
dependent variable
Market cap
Second, move the
independent variables, Third, select the Enter
Net Sales, Net Profit and method for entering the
PE/E Ratio variables into the analysis
from the drop down Method
menu.
Then Click on Save
SW388R7
Data Analysis & Computers II
Select Unstandardized
Residuals
Select Cook’s Distance
Outlier Analysis
• A new variable will be created with name
“COO_1”
• The thumb rule says that any case (observation)
having values greater than
4/(n–k–1)
Where n = Sample size
k= Number of independent variables
Can be considered as outlier for regression analysis.
Accepted Cook’s Distance in this case is 4/(21-3-1) = 0.235294
Hence the cases having Cook’s distance < 0.235294 are selected
Hence case 4 , case 11, case 12, and case 16 are outliers
• This is an iterative process after removing
outliers again the Cook’s distance is
measured for the revised data this will
appear as variable
• “COO_2” if the thumb rule is applied
• Accepted Cook’s Distance in this case is
4/(17-3-1) = 0.30769
It can be verified from the diagram that the
case 13 is an outlier.
Removing it the process is repeated
Accepted cook’s distance is now
4/ (16 – 3 – 1 ) = 0.333333
In the third iteration, Case 5 is an outlier ( cook’s
Distance 0.34), hence removed from the
analysis.
Removing case 5 from the analysis the outlier
analysis is complete as all the cooks distance
now are within the required range
Test of Normality
After removing the outliers the following is the test of normality
(One Sample K-S Test) for the dependent as well as independent
Variables.
All the variables pass the normality test indicating we may conduct
MRA on the data
R2 is High
D-W Stats is within
the range 1.5 to 2.5
This ANOVA for regression is rejected
Ho: there is no relationship between all the independent variables
and the dependent variable
Or
Ho : b0 = b1 = b2 = b3 = 0
Since we have used Enter method testing the significance of regression
coefficients is important. This table gives individual coefficients and its t
test significance
The equation y = b0 + b1* Net Sales + b2 * Net Profit = b3* P/E
Can be represented as
y = -34348.3 – 0.196* Net Sales + 31.582* Net Profit + 1103.73* P/E
The coefficient of Net sales is not significant (Hypothesis not rejected)
the other two coefficients are significant (Hypothesis rejected)
All the tolerance and VIF statistics is within acceptable limits
Since all coefficients are not rejected this is not appropriate model and
cannot be used for estimation
We will conduct Regression analysis ignoring Net Sales (as its
coefficient is not significant). We may also use stepwise regression
method, wherein Net Sales will be removed from the model and the
remaining model will be more appropriate.
D-W Stats is within the
range 1.5 to 2.5
R2 is also very High which means this model if used for prediction will
have less errors
Though both ANOVA’s are rejected, in Stepwise regression, the last
Model( In this case Model 2 ) is Appropriate.
It may be noted that all the coefficients in the last model (Model 2) are
significant
Tolerance and VIF also have acceptable values.
Hence we can interpret 95% confidence Interval
CI for Constant => (- 53986.629 , - 19332.671)
CI for(Net Profit) b1 => (24.841,37.066)
CI for(P/E) b2 => (633.825,1701.231)
Find 95 % CI for prediction for Market Cap given Net profit =2000 and
P/E = 25 ?
Ans : (11541.544, 97329.99) ( Point estimate = 54435.77 )
Residual Analysis
Histogram
Dependent Variable: M-Cap’ 05Amount
6
Frequency
Mean =9.86E-16
Std. Dev. =0.931
0 N =16
-2 -1 0 1 2 3
Regression Standardized Residual
This graph is almost Normal Distribution
Residual Analysis
Normal P-P Plot of Regression Standardized Residual
Dependent Variable: M-Cap’ 05Amount
1.0
0.8
Expected Cum Prob
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob
This graph is almost Normal Distribution ( Since Observations are very near to the diagonal )
Residual Analysis
Scatterplot
Dependent Variable: M-Cap’ 05Amount
3
Regression Standardized Residual
-1
-2
-2 -1 0 1 2
Regression Standardized Predicted Value
This graph indicates no threat to Homoscedasticity
Residual Analysis
Partial Regression Plot
Dependent Variable: M-Cap’ 05Amount
40000
20000
M-Cap’ 05Amount
-20000
-1000 -500 0 500 1000 1500
Net Profit
This graph indicates that Net Profit has linear relationship with Market Cap
Residual Analysis
Partial Regression Plot
Dependent Variable: M-Cap’ 05Amount
30000
20000
M-Cap’ 05Amount
10000
-10000
-20000
-15 -10 -5 0 5 10 15
P/E
This graph indicates that P/E ratio has linear relationship with Market Cap
Conclusion
• Since all the assumptions of MRA are tested
and are followed by the model, and the R2 is
very high, we can use this model for
prediction.
Thank You