0% found this document useful (0 votes)
29 views64 pages

CH 05

Uploaded by

aurora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views64 pages

CH 05

Uploaded by

aurora
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Chapter 5: Multiple Regression


Analysis

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Overview

What Is Multiple Regression Analysis?


Multiple Regression in the Era of Big Data
A Decision Process for Multiple Regression Analysis
• Stage 1: Objectives of Multiple Regression
• Stage 2: Research Design of a Multiple Regression Analysis
• Stage 3: Assumptions in Multiple Regression Analysis
• Stage 4: Estimating the Regression Model and Assessing Overall Model Fit
• Stage 5: Interpreting the Regression Variate
• Stage 6: Validation of the Results
Extending Multiple Regression
Illustration of a Regression Analysis
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Learning Objectives
Upon completing this chapter, you should be able to do the following:
• Determine when regression analysis is the appropriate statistical tool.
• Understand how regression helps us make both predictions and explanations using the
least squares concept.
• Use dummy variables with an understanding of their interpretation.
• Be aware of the assumptions underlying regression analysis and how to assess them.
• Understand the implications of managing the variate and its impact on the results.
• Address the implications of user- versus software-controlled variable selection and
explain the options available in software controlled variable selection.
• Interpret regression results and variable importance, especially in light of
multicollinearity.
• Apply the diagnostic procedures necessary to assess influential observations.
• Understand the benefits gained from the extended forms of regression, namely multi-
level models and panel models.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

WHAT IS MULTIPLE REGRESSION ANALYSIS?

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Multiple Regression Defined

Multiple Regression Analysis


• Statistical technique that can be used to analyze the relationship between a single
dependent (criterion) variable and several independent (predictor) variables.

Key Component
• Regression variate
• Linear combination of weighted independent variables best predicts the dependent variable.

Variable Types in Multiple Regression


• Dependent – metric
• Independent – metric or transformed non-metric (through dummy variable
coding)

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

The Regression Variate


Example: Credit Card Usage Based on
Variate Specification Family Size and Income

Y’ = b0 + b1X1 + b2X2 + . . . + bnXn + ε Y’ = b0 + b1X1 + b2X2 + ε

Y’ – Dependent variable Y’ – number of credit cards

b0 – Intercept (constant) b0 – number of credit cards independent of family


size and income
b1, b2, … bn – regression weight indicating the
change in the dependent variable b1, b2 – change in number of credit cards for a
associated with a unit change in the unit change in family size and income,
independent variable respectively
X1, X2, … Xn – independent variables X1, X2 – values of family size, income respectively

ε – Prediction error (residual) ε – Prediction error (residual)


For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Graphically Portraying the Variate


The Y’ value is a linear combination of the entire
set of independent variables that best achieves
the statistical objective.
X3

Circles represent each independent variables’


association/correlation with Y Y’
• The entire circle for each X represents the X1
univariate correlation with Y
• Areas of overlap represent the shared X2
correlation with other variable(s)
• Non-overlapping areas for each IV indicate the
unique impact (i.e., the regression weight)
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

MULTIPLE REGRESSION IN THE ERA OF BIG DATA

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Multiple Regression in the Era of Big Data

Historical Relevance
• Multiple regression has historically been the dominant statistical technique for
understanding dependence relationships.
• Particularly useful in providing explanation of importance of independent variables.

Era of Big Data


• Still dominant technique for estimation of the statistical/data model (see Chapter 1).
• Yet many new applications, such as automated decision-making (e.g., instantaneous
credit approval, online advertising placement, and many others), change the focus to
alternative techniques (e.g., machine learning) emphasizing prediction at expense of
explanation.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

MULTIPLE REGRESSION DECISION PROCESS

Stage 1: Objectives of Multiple Regression


Stage 2: Research Design of Multiple Regression
Stage 3: Assumptions in Multiple Regression Analysis
Stage 4: Estimating the Regression Model and Assessing Overall Fit
Stage 5: Interpreting the Regression Variate
Stage 6: Validation of the Results

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 1: OBJECTIVES OF MULTIPLE REGRESSION

❑ Research Problems Appropriate for Multiple Regression


❑ Specifying a Statistical Relationship
❑ Selection of Dependent and Independent Variables

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Research Problems Appropriate for Multiple Regression

Prediction with Multiple Regression:


• Maximize predictive accuracy
• always crucial to ensuring the validity of the set of independent variables.
• Model comparison
• comparing two or more sets of independent variables to ascertain the predictive power of
each variate.
Explanation With Multiple Regression:
• Relative importance of independent variables
• objectively assessing the magnitude and direction (positive or negative) of each independent
variable’s relationship.
• Nature of relationships (e.g., linear versus nonlinear) with dependent variables
• Nature of relationships among independent variables
• Impact of multicollinearity in assessing relative importance of independent variables.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Selection of Dependent and Independent Variables


The researcher should always consider three issues that can affect any
decision about variables:

• The theory that supports using the variables.

• Measurement error, especially in the dependent variable


• Only structural equation modeling (SEM) can directly accommodate measurement error, but
using summated scales can mitigate it when using multiple regression.

• Specification error – the exclusion of relevant (and inclusion of irrelevant)


independent variables
• When in doubt, include potentially irrelevant variables (as they can only confuse
interpretation) rather than possibly omitting a relevant variable (which can bias all regression
estimates).
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Examples of Specification Error: Exclusion and Inclusion of Variables

When in doubt, include potentially irrelevant variables (they can only


confuse interpretation) rather than possibly omitting a relevant variable
(which can bias all regression estimates).

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 2: RESEARCH DESIGN OF A


MULTIPLE REGRESSION ANALYSIS
❑ Sample Size
❑ Creating Additional Variables

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Sample Size Considerations

Statistical Power
• Simple regression can be effective with a sample size of 20.
• Maintaining power at .80 in multiple regression requires a minimum sample of 50
and preferably 100 observations for most research situations.

Generalizability
• The minimum ratio of observations to variables is 5 to 1, but the preferred ratio is
15 or 20 to 1, and this should increase when stepwise estimation is used.
• Maximizing the degrees of freedom improves generalizability and addresses both
model parsimony and sample size concerns.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈
Variable Transformations To Represent Unique Elements of the
Dependence Relationship

Nonmetric variables
• can only be included in a regression analysis by creating dummy variables.
• Dummy variables can only be interpreted in relation to their reference category.

Representing Nonlinear Relationships


• Adding an additional polynomial term represents another inflection point in the
curvilinear relationship.
• Quadratic and cubic polynomials are generally sufficient to represent most curvilinear
relationships.
• Assessing the significance of a polynomial or interaction term is accomplished by evaluating
incremental R2, not the significance of individual coefficients, due to high multicollinearity.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Representing Interaction or Moderator Effects


Moderator effect
• When the moderator variable, a second independent variable, changes the form
of the relationship between another independent variable and the dependent
variable.
• Long been a primary topic of interest since it addresses the fundamental question
of “When” does this effect occur.
• Also known as an interaction effect and is similar to the interaction term found in
analysis of variance and multivariate analysis of variance (see Chapter 6).
• The moderator term is a compound variable formed by multiplying X1 by the
moderator X2, which is entered into the regression equation.
• The coefficient of the interaction term (i.e., the moderator effect) indicates the
unit change in the effect of X1 as X2 changes.
• The coefficients of the two independent variables now represent the effects when
the other independent variable is zero.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Adding A Mediation Effect

Mediation
• Occurs when the effect of an independent variable may “work through” an
intervening variable (the mediating variable) to predict the dependent variable.
• In this situation the independent variable may have a direct effect on the
dependent measure as well as an indirect effect though the mediating variable to
the dependent variable.
• Most commonly associated with ANOVA and MANOVA models (see Chapter 6 for
an extended discussion), can play an important role in defining the roles of
potential independent variables.
Designation of Mediation Effects
• Designation of a mediation effect is a conceptual decision by the researcher as it
has little or no impact on the effects of other independent variables.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 3: ASSUMPTIONS IN
MULTIPLE REGRESSION ANALYSIS
❑ Assessing Individual Variables Versus the Variate
❑ Methods of Diagnosis
❑ Linearity of the Phenomenon
❑ Constant Variance of the Error Term
❑ Normality of the Error Term Distribution
❑ Independence of the Error Terms

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Primary Assumptions of Multiple Regression


Four Primary Assumptions
• Linearity of the phenomenon measured.
• Homoscedasticity – Constant variance of the error terms.
• Normality of the error term distribution.
• Independence of the error terms.
Assessing Individual Variables Versus the Variate
• Testing assumptions must be done:
• for each dependent and independent variable.
• for the variate as well.
Methods of Diagnosis
• Principal diagnostic measure is the standardized or studentized residual.
• Graphical analyses (i.e., null plot/residual plot, partial regression plots, and normal
probability plots) are the most widely used diagnostic methods.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Assessing Linearity and Homoscedasticity


Linearity
• Critical issue in representing the “true” relationship since the concept of
correlation, the measure of association underlying regression analysis, is based on
a linear relationship.
• Examined through residual plots and comparison to null plot
• Corrective actions available are:
• Transforming the data values.
• Including the nonlinear relationships in the regression model (e.g., polynomials).
• Specialized methods such as nonlinear regression.
Homoscedasticity – constant variance of the error term
• Diagnosis with graphical plots or simple statistical tests
• Remedies include:
• Variable transformation, weighted least squares or heteroscedasticity-consistent standard
errors (HCSE).
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Assessing Normality and Independence of the Error Terms

Normality
• Applies to error terms/residuals, but remedies are to the variables themselves.
• Graphical diagnostic – normal probability plot.
• Regression generally considered robust to violations of normality when sample
size exceeds 200.

Independence of Error Terms


• Predicted values, and thus error terms, are not correlated to any variables not in
the analysis.
• Graphical test – residuals plotted versus offending variables.
• Two basic types of offending variables:
• Time series data.
• Clustered data – common example is students within classroom.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 4: ESTIMATING THE REGRESSION MODEL AND


ASSESSING OVERALL MODEL FIT
❑ Managing the Variate
❑ Variable Specification
❑ Variable Selection
❑ Testing the Regression Variate for Meeting the Regression Assumptions
❑ Examining the Statistical Significance of Our Model
❑ Understanding Influential Observations

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Managing The Variate: Variable Specification and Selection

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Variable Specification

Two options
• Use variables in their original form
• Allows for use of direct measures of the variables of interest.
• As number of variables increases, interpretability may become problematic.

• Employ some form of dimensional reduction


• Most common approach to address multicollinearity among the independent variables.
• Can be either software controlled or user controlled.
• Software controlled – software independently forms dimensional reduction and then proceeds with
analysis (e.g., principal components regression).
• User controlled – research performs some form of dimensional reduction process (e.g., exploratory
factor analysis) and forms composites which are then substituted for original variables in the
analysis.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Variable Selection Approaches – User Controlled

User-controlled
• Confirmatory (Simultaneous)
• the only method to allow direct testing of a pre-specified model.
• also the most complex from the perspectives of specification error, model parsimony and
achieving maximum predictive accuracy.

• Combinatorial (All-Possible-Subsets)
• provides control by allowing the researcher to review the entire set of roughly equivalent
models in terms of predictive accuracy.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Variable Selection Approaches – Software Controlled


Software-controlled
• Sequential Search Methods:
• while maximizing predictive accuracy, represents a completely “automated” approach to
model estimation, leaving the researcher almost no control over the final model specification.
• Forward Inclusion & Backward Elimination.
• Stepwise (variables possible removed once included in regression equation).
• Caveats – impact of multicollinearity, loss of researcher control and increased alpha level.
• Constrained
• A variant of sequential methods whereby variables regression weights are constrained to
maximize parsimony in the final model results.
• Ridge
• LASSO

Choosing Between User- and Software-Controlled Approaches


• No single method is “Best” and the prudent strategy is to use a combination of
approaches to capitalize on the strengths of each.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Assess the Statistical Significance of the Overall Model

Components of Model Fit


• Total Sum of Squares (SST)
• total amount of variation that exists to be explained by the independent variables.
• TSS = the sum of SSE and SSR.
• Sum of Squared Errors (SSE)
• the variance in the dependent variable not accounted for by the regression model = residual.
• The objective is to obtain the smallest possible sum of squared errors as a measure of
prediction accuracy.
• Sum of Squares Regression (SSR)
• the amount of improvement in explanation of the dependent variable attributable to the
independent variables.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Measures of Fit for Individual Observations

Least Squares Regression Line and Individual Observations

Y Deviation not
explained by
regression
Total Deviation

Y = average

Deviation
explained by
regression

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Measures of Overall Model Fit

F statistic – statistical significance of overall model


• Significance means that it is unlikely your sample will produce a large R2 when the
population R2 is actually zero.
• A rule of thumb is there must be <.05 probability for statistical significance.
R2 (Coefficient of Determination) – strength of overall variate relationship
• Represents the percent of variation (i.e., goodness of fit) in the dependent variable
associated (“explained) by all of the independent variables considered together.
• R2 ranges from 0 to 1.0, with large R2 indicating the linear relationship works well.
• Statistical significance does not ensure practical significance, which is based on the
meaningfulness of the results. Example, is explaining 4 percent of the variation
worth the cost of collecting and analyzing the data?
Adjusted R2
• based on the number of independent variables relative to the sample size.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Significance Tests of Regression Coefficients

Statistical Inference test for each estimated coefficient


• Establishing a confidence interval
• Specify desired alpha level (typically .05).
• Compute standard error – expected sampling error of the coefficient, similar to a standard
deviation of an individual variable.
• Confidence interval – determine number of standard errors based on alpha level (e.g., 1.96 for
.05 alpha) times the value of the standard error [e.g., .05 level = ±1.96 ×standard error].
• Applying the confidence interval
• Statistical significance established if confidence interval does not include zero.
• Sample size has direct influence on standard error – increased sample size reduces standard
error, thus making statistical significance more likely.
• Practical significance must also be assessed . . .
• Always ensure practical significance when using large sample sizes, as the model results and
regression coefficients could be deemed irrelevant even when statistically significant due just to the
statistical power arising from large sample sizes.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Types of Influential Observations

Influential observations include all observations that


• lie outside the general patterns of the data set.
• disproportionate effect on the regression results.

Three basic types based upon the nature of their impact on the results:
• Outliers are observations that have large residual values and can be identified
only with respect to a specific regression model.
• Leverage points are observations that are distinct from the remaining
observations based on their independent variable values.
• Influential observations are the broadest category, including all observations that
have a disproportionate effect on the regression results. Influential observations
potentially include outliers and leverage points but may include other
observations as well.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Impacts of Influential Observations

Reinforcing (Figures 5.1a,b)


• Reinforce (5.1a) or even strongly
define (5.1b) the relationships
Conflicting (Figures 5.1c, e, f)
• have an effect that is contrary to
the general pattern of the
remaining data but still have
small residuals
Shifting (Figure 5.1d)
• affect all of the results in a similar
manner

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Identifying Influential Observations

Step 1: Examining Residuals and Partial Regression Plots


• Residuals – defined by a) cases used to calculate and b) type of standardization
• Partial Regression Plots – depict relationship of variable controlling for other variables
Step 2: Identifying Leverage Points
• Leverage points – substantially different on one or more independent variables
• Diagnostic measures – Hat matrix and Mahalanobis distance (D2)
Step 3: Single-case Diagnostics
• Empirical measures of each case’s influence on:
• Individual coefficients – SDFBETA
• Overall results – Cook’s distance, COVRATIO and SDFFIT

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Step 4: Selecting Influential Observations


Observations classified into four
groups based on combination
of residuals and leverage
A. No issues – fit well and no
extreme values
B. High leverage, but not outlier
– very different on IVs, but still
predicted well by model
C. Outliers, But Acceptable
Leverage – high residual, but
no extreme values on IV
D. Outliers and High Leverage –
poor prediction and quite
different on IV
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Corrective Actions for Influentials

Four conditions reflecting the potential remedy for an influential observation


1. An error in observations or data entry
• remedy by correcting the data or deleting the case.
2. A valid but exceptional observation that is explainable by an extraordinary situation
• remedy by deletion of the case unless variables reflecting the extraordinary situation are
included in the regression equation.
3. An exceptional observation with no likely explanation
• presents a special problem because there is no reason for deleting the case, but its inclusion
cannot be justified either, suggesting analyses with and without the observations to make a
complete assessment.
• An ordinary observation in its individual characteristics but exceptional in its
combination of characteristics
• indicates modifications to the conceptual basis of the regression model and should be retained.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 5: INTERPRETING THE REGRESSION VARIATE

❑ Using the Regression Coefficients


❑ Assessing Multicollinearity
❑ Relative Importance of Independent Variables

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Using the Regression Coefficients


Key Functions of Regression Coefficients
• Prediction
• Estimation – minimize the residuals and produce expected value for each observation.
• Forecasting – allow for predicted value for any set of values for independent variables.
• Explanation
• Interpretation with Regression Coefficients – primary measure of the relative impact and
importance of the independent variables in their relationship with the dependent variable
• Comparison between independent variables problematic if on different scales.
• Standardizing the Regression Coefficients: Beta Coefficients – converts variables to a common
scale and variability, the most common being a mean of zero (0.0) and standard deviation of
one (1.0).
• Eliminate the problem of dealing with different units of measurement and thus reflect the relative
impact on the dependent variable of a change in one standard deviation in either variable.
• Caveats in use as measure of importance:
• guide to the relative importance of individual independent variables only when collinearity is minimal.
• interpreted only in the context of the other variables in the equation.
• Reflect change in the dependent measure for a one standard deviation change in the independent variable.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Assessing Multicollinearity

Multicollinearity
• relationship between two (collinearity) or more (multicollinearity) independent
variables. Multicollinearity occurs when any single independent variable is highly
correlated with a set of other independent variables.

Steps in Assessing and Addressing Multicollinearity


1. Understand new measures of correlation which incorporate multicollinearity.
2. Assess the degree of multicollinearity.
3. Determine its impact on the results.
4. Apply the necessary remedies if needed.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Measures of Correlation Incorporating Multicollinearity

Three Measures of Correlation


• Bivariate or zero-order correlation
• Association between two variables,
not accounting for the variation
shared with any other variables.
• Appears in the correlation matrix.
Measures reflecting unique explanation
• Semi-partial or part correlation
• Unique predictive effect.
• Partial correlation
• Incremental predictive effect.

Insights of each form of correlation will


be seen in later sections
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Identifying Multicollinearity
Variance Inflation Factor (VIF) – measures how much the variance of the regression
coefficients is inflated by multicollinearity problems. The square root of the VIF is the
expected increase in the standard error of the coefficients.
• A VIF of 0 indicates no correlation between the independent measures.
• As the VIF increases it indicates a higher degree of association between the predictor variables.
• For example, a VIF measure of 1 is generally not enough to cause problems.
• However, a VIF value of 5 is generally thought to be the maximum acceptable; anything higher would
indicate a problem with multicollinearity.
Tolerance – the amount of variance of an independent variable that is not explained by the
other independent variables (i.e., an independent variable is considered a dependent
variable, predicted by all the other independent variables).
• Small values for tolerance indicate problems of multicollinearity.
• The minimum cutoff value for tolerance is typically .20 (i.e., value less than .20 are problematic).
That is, the tolerance value must be smaller than .20 to indicate a problem of multicollinearity.

VIF and Tolerance are inversely related: VIF = 1 / Tolerance


For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Effects of Multicollinearity

All Impacts arise from the shared variance among variables which can not be
attributed to any single variable
Impacts on Estimation
• Decrease in explained variance – as multicollinearity increases, unique explanatory
effects of variables decline, thus overall decline in predicted variation (R2).
• Singularity – if reaches 1,0, precludes model estimation.
• Increases in standard error – as shown by VIF, multicollinearity increases standard
errors and makes it more difficult to achieve statistical significance.
• Reversal of signs of Coefficients – signs can “reverse” from bivariate relationships.
Impacts on Explanation
• Since coefficients only represent unique explanation, multicollinearity can obscure
the total effect of a variables, which requires newer measures of relative importance.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

How Much Multicollinearity is Acceptable And Remedies


Bivariate correlations
• Values of .70 or higher may result in problems and lower values may be problematic
if they are higher than the correlations with the dependent variable.
Tolerance or VIF
• Tolerance values up to .20, corresponding to a VIF of 5, almost always indicate
problems with multicollinearity.
• VIF values of even 3 to 5 may result in interpretation or estimation problems,
particularly when the relationships with the dependent variable are weaker.
Remedies for Multicollinearity
• Delete collinear variable(s).
• Apply dimensional reduction, such as composites from exploratory factor analysis.
• Specific estimation techniques – Bayesian or principal components regression.
• Do nothing – particularly if used solely for prediction, but still risky.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Relative Importance of Independent Variables

Represent the overall impact of the independent variables


• Accounting for both shared and unique variance explained.
• Measures that are comparable across all the independent variables.
All measures provide some insights, best to use combination
Direct Measures (from regression results)
• Bivariate correlations – “starting point” representing total relationship, but does
not separate unique versus shared effects.
• Squared semi-partial correlation – percentage of the variance in the dependent
variables that is unique to the independent variable.
• Regression weights – only unique relationship, potentially impacted by
multicollinearity.
• Beta (standardized) weights – regression weights on standardized scale, but still
impacted by multicollinearity.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Additional Measures of Relative Importance


Provide additional insights into variable impact in presence of multicollinearity
• All possible subsets regression – foundation for several measures discussed below.
• Structure coefficients – bivariate correlations with predicted value.
• Direct measure of contribution to predicted value.
• Do not distinguish between unique and shared variance.
• Commonality analysis – divides impact into unique and shared components
• Based on all possible subsets regression and all unique combinations of variables.
• Negative effects may indicate suppression effects.
• Dominance analysis – average squared semi-partial correlation across all possible
subsets regression models.
• Two forms of dominance between variables based on if one variable always has greater squared
semi-partial correlation, no matter what other variables are in the model.
• Relative Weights – sum to R2, but do not distinguish between unique and shared
variance.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

STAGE 6: VALIDATION OF THE RESULTS

❑ Additional or Split Samples


❑ Calculating the PRESS Statistic
❑ Comparing Regression Models
❑ Forecasting with the Model

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Validation of the Regression Model

Ensure that the model represents the general population (generalizability)


and is appropriate for the situations in which it will be used (transferability).
• Additional or Split Samples
• Comparison of results to ensure comparability of results across differing samples.
• Calculating the PRESS (Prediction Sum of Squares) statistic
• Employs jackknife procedure to calculate measure of predictive fit.
• Used to calculate P2 (coefficient of prediction) which is measure of expected predictive accuracy.
• Comparing Regression Models
• Most common standard – R2, but does always increase as variables added.
• Alternatives – Akaike Information Criterion and the Bayesian Information Criterion.
• model with the best predictive power has lowest values on these measures.
• Forecasting
• Must always ensure comparability of new data to dataset used in estimation.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

EXTENDING MULTIPLE REGRESSION

❑ Multilevel Models
❑ Panel Models

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Multilevel Models

Unified framework for addressing many of the statistical issues which occur
naturally when hierarchical/nested data structures are present

Background
• Context – any external factor outside the unit of analysis that:
• impacts the outcome of multiple individuals.
• creates differences between separate contexts and fosters dependencies within a single context.
• Hierarchical data structure – observations which have a natural nesting effect created
by contexts, with Level-1 observations nested with context represented by Level-2.
• Multilevel model (MLM) – extension of regression analysis that allows for the
incorporation of both individual (Level-1) and contextual (Level-2) effects with the
appropriate statistical treatment.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Basic Concepts and Issues


Matching Measurement Properties to Level
• Ensure that you avoid ecological fallacy (using characteristic of a higher-level to
represent characteristics of a lower level) and atomistic fallacy (group level
relationships are assumed to equate to individual-level relationships).
Intraclass Correlation (ICC)
• Degree of dependence among individuals within a higher-level grouping.
• Demonstrates that the individuals in a context are not independent, thus violating
this assumption of regression.
Fixed Versus Random Effects
• Fixed effects – regression coefficient made as a point estimate with no variation.
• Random effects – best estimate of the variability or distribution of effects across a
set of groups/contexts.
Sample size by level – small group sizes as long as number of groups large.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Five stage Modeling Strategy for MLM


1. Sufficient Variation At Level 2
• Ensure enough variation between groups to justify including the level in the analysis.
2. Level-1 Model with Level-2 Effects
• Basic regression equation for Level-1 is specified as fixed effects and the Level-2
effects (intercepts) are added.
3. Introduce Level-2 Independent Variables
• Level-2 characteristics are introduced into the Level-2 equations to establish their
relationships with the Level-1 parameters.
4. Test for Random Coefficients of Level-1 Variables
• Do the intercepts or Level-1 coefficients vary across groups?
5. Add Cross-level Interactions to Explain Variations in Coefficients
• identify which Level-2 characteristics are related to the Level-1 variables that had
random coefficients.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Panel Models

Cross-sectional analysis of longitudinal or time-series data

Similarity to MLM
• Unit of observation (e.g., individual, class, firm) becomes a group (Level-2) with
multiple observations (Level-1).
• Accommodates serial correlation inherent in longitudinal data.

Benefits of Unified Framework


• By using a fixed effects estimate for an effect, the omitted variable problem
(endogeneity) is accounted for.
• Ability to employ a full range of independent variables also allows for the testing
of more complex models than alternative methods.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Basic Issues in Panel Models

Four Types of Variables


• Differ between units of analysis, but don’t change over time (e.g., gender).
• Change over time, but are the same for all units of analysis at any given time
period (e.g., national economic indicators).
• Vary over both time and units (e.g., income).
• Varies over time in a predictable pattern (e.g., any measure of age or tenure).

Selecting Between Fixed Versus Random Effects


• Trade-off between controlling for endogeneity with fixed effects versus the
statistical efficiency, yet biased results that result from random effects.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Basic Issues in Panel Models

Types of Models
• Basic model – simple pooled regression, which disregards the interdependencies
among observations within a unit of analysis.
• Unit-specific results (similar to the random effects in the multilevel models) where
intercepts, coefficients or both vary by unit.
• Unique model – time dependent effects model, where the intercept and
coefficients may vary over time as well.

Adding Time
• Panel models also provide for estimating time-variant effects just as was possible
for unit-specific effects.
• Requires at least enough time periods for a basic relationship to be estimated (five
or more).
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

ILLUSTRATION OF A REGRESSION ANALYSIS

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Stages One, Two and Three

Stage 1: Objectives
• Predict the customer satisfaction based on their perceptions of HBAT’s
performance and identify the factors that lead to increased satisfaction.
Stage 2: Research Design
• Thirteen independent variables (X6 to X18).
• Meets minimum ratio of observations per variable – 7:1 with adequate power.
Stage 3: Assumptions
• Linearity – graphical analysis did not reveal nonlinear relationships.
• Homoscedasticity – only two variables (X6 and X17) had minimal violations.
• Normality – six variables indicated violations, thus requiring further analysis.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Stepwise Results

R2 – .791
Standard error of the
estimate – .559

Five significant variables:


X9 – Complaint Resolution
X6 – Product Quality
X12 – Salesforce Image
X7 – E-Commerce
X11 – Product Line

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Influential Plot
Eight observations
qualify as outliers,
but still have
acceptable leverage.

Four observations
have high leverage,
but well-predicted
by model.

No observations are
outliers with high
leverage.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Measures of Variable Importance

Most direct measures


of variable
importance point
to X12 because it
has the largest
unique impact of all
of the independent
variables.
X9 also becomes
relatively important
when shared impact
is considered.
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

EVALUATING ALTERNATIVE REGRESSION MODELS

❑ Confirmatory Regression Model


❑ Use of Summated Scales as Remedies for Multicollinearity
❑ Including a Nonmetric Independent Variable

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Comparison of 4 Models

Combination of variable specification (original variables versus composite


measures) and variable selection (simultaneous and stepwise) options

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Including a Nonmetric Independent Variable: X

• X3 (Firm size) is binary nonmetric variable.


• Positive value of the coefficient indicates that large firms, given their
characteristics on the other five independent variables in the equation,
still have a customer satisfaction level that is about a quarter point higher
(.271) on the 10-point customer satisfaction question.

For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA
© 2019 Cengage版權所有,為課本著作之延伸教材,亦受著作權法之規範保護,僅作為授課教學使用,禁止列印、影印、未經授權重製和公開散佈

Learning Checkpoints

1. When should multiple regression be used?


2. Why should multiple regression be used?
3. What level of statistical significance and R2 would justify use of
multiple regression?
4. How do you use regression coefficients?
5. What are the options for interpreting the relative importance of
the significant variables?
6. How does variable specification (using original variables versus
composites from exploratory factor analysis) impact the
regression results for prediction and explanation
For use with Hair, Black, Babin and Anderson, Multivariate Data Analysis 8e © 2018 Cengage Learning EMEA

You might also like