Lecture 6
Lecture 6
Collinearity
-Collinearity, or multicollinearity, occurs when one explanatory variable is, or
nearly is, a linear combination of the other explanatory variables.
- Intuitively, with collinear data it is useful to think of explanatory variables as
highly correlated with one another.
-If an explanatory variable is collinear, then the question arises as to whether it
is redundant, that is, whether the variable provides little additional information
beyond the information in the other explanatory variables.
- The issues are: Is collinearity important? If so, how does it affect our model
fit and how do we detect it? To address the first question, consider a somewhat
pathological example
i 1 2 3 4
yi 23 83 63 103
xi1 2 8 6 10
xi2 6 9 8 10
1
Correlation coefficient matrix
y x1 x2
y 1
x1 1 1
x2 1 1 1
2
Collinearity Facts
• Collinearity precludes us neither from getting good fits nor from making
predictions of new observations. Note that in the prior example, we got
perfect fits.
3
• Remove one of the correlated variables: If you have a set of predictor vari-
ables that are highly correlated, you can choose to keep only one of them
in the model. This approach ensures that the remaining variables are not
redundant and reduces the problem of collinearity. The selection can be
based on domain knowledge, statistical significance, or the strength of the
relationship with the outcome variable.
• Collect more data: Increasing the sample size can sometimes help alleviate
collinearity issues. With a larger dataset, the estimates of the regression
coefficients can become more stable and reliable, even in the presence of
collinearity.
Before applying these strategies, it’s crucial to identify and quantify the degree
of collinearity. Common diagnostics include calculating correlation coefficients,
variance inflation factors (VIF), and examining the eigenvalues of the correlation
matrix.
4
• To capture more complex relationships among several variables, we intro-
duce the variance inflation factor (VIF).
q
• we interpret Rj = Rj2 as the multiple correlation coefficient between xj
and linear combinations of the other x0 s.
• The variance inflation factor (VIF) for the j th predictor is defined as,
1
V IFj = 1−R 2 , for j = 1, 2, . . . , k.
j
• A larger Rj2 results in a larger V IFj ; this means greater collinearity be-
tween xj and the other x0 s.
• Now Rj 2 alone is enough to capture the linear relationship of interest.
However, we use V IFj instead of Rj 2 as our measure for collinearity be-
V IFj
cause of the algebraic relationship se(bj ) = s s j √n−1
x
Here, se(bj )and s are standard errors and residual standard deviation from
a full fit of y on x1 , . . . , xk
Further,p
sx j = (n − 1)−1 (xij − x¯j )2 the sample standard deviation of the j th
variable xj .
Call :
lm ( f o r m u l a = VOLUME ˜ PRICE + SHARE + VALUE, data = L i q u i d i t y )
5
Residuals :
Min 1Q Median 3Q Max
−20.708 −4.179 −1.091 3.108 28.230
Coefficients :
Estimat e Std . E r r o r t v a l u e Pr ( >| t | )
( Intercept ) 7.90916 1.60140 4.939 2 . 5 9 e −06 ∗∗∗
PRICE −0.02224 0 . 0 3 5 0 3 −0.635 0.5267
SHARE 0.05372 0.01034 5.194 8 . 6 3 e −07 ∗∗∗
VALUE 0.31271 0.16150 1.936 0.0552 .
−−−
S i g n i f . codes : 0 ‘∗∗∗ ’ 0.001 ‘∗∗ ’ 0.01 ‘∗ ’ 0.05 ‘. ’ 0.1 ‘ ’ 1
R e s i d u a l s t a n d a r d e r r o r : 6 . 7 2 2 on 119 d e g r e e s o f freedom
M u l t i p l e R−s q u a r e d : 0 . 6 1 0 1 , Adjusted R−s q u a r e d : 0 . 6 0 0 3
F− s t a t i s t i c : 6 2 . 0 7 on 3 and 119 DF, p−v a l u e : < 2 . 2 e −16
> v i f ( Model )
PRICE SHARE VALUE
1.512535 3.827365 4.685602
Because each VIF statistic is less than 10, there is little reason to suspect
severe collinearity.
This is interesting because you may recall that there is a perfect relation-
ship among PRICE, SHARE, and VALUE in that we defined the market
value to be VALUE = PRICE × SHARE. However, the relationship is
multiplicative, and hence is nonlinear. Because the variables are not lin-
early related, it is valid to enter all three into the regression model. From
a financial perspective, the variable VALUE is important because it mea-
sures the worth of a firm. From a statistical perspective, the variable
VALUE quantifies the interaction between PRICE and SHARE (interac-
tion variables were introduced in Section 3.5.3).
6
among variables. Because the data tends to be observational in lieu of
experimental in nature, there is little that the analyst can do to avoid this
situation.
Note: When severe collinearity exists, often the only option is to remove one
or more variables from the regression equation.
Example 2
For a linear regression model with two explanatory variables, X1 and X2 you
are given:
Determine which of the explanatory variables will exceed the threshold of the
variance inflation factor for determining excessive collinearity.
Solution: In class
Suppressor Variables
• As we have seen, severe collinearity can seriously inflate standard errors
of regression coefficients.
• Because we rely on these standard errors for judging the usefulness of ex-
planatory variables, our model selection procedures and inferences may be
deficient in the presence of severe collinearity.
7
• Despite these drawbacks, mild collinearity in a dataset should not be
viewed as a deficiency of the dataset; it is simply an attribute of the
available explanatory variables.
Example 3
Given below is the correlation matrix and scatter plot matrix of hypothetical
dataset of 50 observations with a response variable (y) and two explanatory
variables (x1 and x2 )
8
Scatterplot matrix of a response and two explanatory variable for the suppressor
variable example.
Here, we see that the two explanatory variables are highly correlated. Now
recall, for regression with one explanatory variable, that the correlation coeffi-
cient squared is the coefficient of determination. Thus,
• for a regression of y on x1 , the coefficient of determination is (0.188)2 = 3.5%.
Selection Criteria
Goodness of Fit
• Goodness of fit is a measure that evaluates how well an observed data set
fits a particular statistical model or hypothesis.
9
• It assesses the extent to which the model accurately represents the under-
lying patterns and variability in the data.
• In other words, it measures how well the model fits the observed data.
• Specifically, we interpret the fitted value yˆi to be the best model approxi-
mation of the ith observation and compare it to the actual value yi .
• We have quantified this through the size of the typical error (s), include
the coefficient of determination (R2 ) and an adjusted version (Ra2 ).
(ErrorSS)p
Cp = sf ull 2 − n + 2p
10
with smaller values of Cp are more desirable.
Note: For most datasets, they recommend the same model, so an analyst
can report any or all three statistics. However, for some applications, they
lead to different recommended models. In this case, the analyst needs to rely
more heavily on non-data driven criteria for model selection (which are always
important in any regression application).
Example 4
An actuary relates insurance sales (y) to eight predictors x1 , x2 , . . . , x8 . She fits
two linear regression models to 27 observations. Model A contains all eight pre-
dictor variables, while model B contains only the first three predictors variables.
The ANOVA tables follow:
Model A
SOURCE SS df MS
Regressioin 115,175 8 14,397
Error 76,893 18 4,272
Total 192,068 26
Model B
SOURCE SS df MS
Regressioin 84,344 3 28,115
Error 107,724 23 4,684
Total 192,068 26
Solution: In class
Model Validation
Model validation is the process of confirming that our proposed model is appro-
priate, especially in light of the purposes of the investigation.
11
Idea
The ideal situation is to have available two sets of data, one for model develop-
ment and one for model validation.
In this way, the data used to validate the model is unaffected by the procedures
used to formulate the model.
Unfortunately, rarely will two sets of data be available to the investigator. How-
ever, we can implement the validation process by splitting the dataset into two
subsamples. We call these the model development and validation subsamples,
respectively. They are also known as training and testing samples, respectively.
3. Using the model created in step (2) and the explanatory variables from
the validation subsample, predict the dependent variables in the validation
subsample,yˆi , where i = n1 + 1, . . . , n1 + n2 (To get these predictions, you
may need to transform the dependent variables back to the original scale.)
4. Assess the proximity of the predictions to the held-out data. One measure
is the sum of squared prediction errors:
Pn1 +n2
SSP E = i=n1 +1 (yi − yˆi )2
Repeat steps (2) through (4) for each candidate model. Choose the model with
the smallest SSPE.
12
Criticism of SSPE
1. First, it is clear that it takes a considerable amount of time and effort to
calculate this statistic for each of several candidate models. However, as
with many statistical techniques, this is merely a matter of having spe-
cialized statistical software available to perform the steps described earlier.
3. Third, and perhaps most important, is that the choice of the relative sub-
set sizes, n1 and n2 , is not clear. Various researchers recommend different
proportions for the allocation. For datasets with 500 or more observations,
use 50% of the sample for out-of-sample validation. Hastie, Tibshirani,
and Friedman (2001) remark that a typical split is 50% for development
and/or training, 25% for validation, and the remaining 25% for a third
stage for further validation that they call testing.
Cross-Validation
Cross-validation is the technique of model validation that splits the data into
two disjoint sets.
• We just discussed out-of-sample validation where the data was split ran-
domly into two subsets both containing a sizable percentage of data.
2. Use the regression coefficients computed in step one and the explanatory
variables for the ith observation to compute the predicted response, yˆi
13
This part of the procedure is similar to the calculation of the SSPE statis-
tic with n1 = n − 1 and n2 = 1.
Advantages
• The PRESS statistic is less computationally intensive than SSPE.
Disadvantage
PRESS does not enjoy the appearance of independence between the estimation
and prediction aspects, unlike SSPE.
Extra Information
Heteroscedasticity
• When studying the linear regression model , we assume that the ran-
dom error terms across n observations possess the same variance, i.e
V ar(i ) = σ 2 for all i = 1, 2, . . . , n, for some unknown parameter σ 2 .
14
(homoscedastic assumption implies same amount of variability for all n
observations from their expected value). In practice, however, it is often
the case that the amount of uncertainty vary on case-by-case basis.
– Unbiasedness: It turns out that the LSEs are still unbiased even in
the presence of heteroscedasticity. This follows from
Detecting Heteroscedasticity
Before developing a strategy for dealing with heteroscedasticity, it is imperative
to assess, or detect its presence. or extent.
• Graphical method: Residual plots. One of the simplest ways to de-
tect heteroscedasticity is to plot the residuals against the predicted values
or the predictors. If the spread of the residuals increases or decreases
systematically as the predicted values change, it suggests the presence of
heteroscedasticity. Look for funnel-shaped patterns or increasing variabil-
ity as the predicted values increase.
15
• More formal test: Breusch-Pagen Test . The Breusch-Pagan test
is a statistical test specifically designed to detect heteroscedasticity. It
involves regressing the squared residuals on the independent variables and
then performing a hypothesis test on the coefficients. If the p-value is
significant, it suggests the presence of heteroscedasticity. However, this
test assumes that the errors follow a normal distribution.
To illustrate, let us consider a test due to Breusch and Pagan (1980). Specif-
ically, this test examines the alternative hypothesis
Ha : V ar(yi ) = σ 2 + zi0 γ, i = 1, 2, . . . , n
where,
σ 2 is baseline variance value
16
Solutions to Heteroscedasticity
Weighted least squares
-The least squares estimators are less useful for data sets with severe het-
eroscedasticity.
Then we have,
√ √
yi ∗ = yi × wi = (β0 xi0 + β1 xi1 + · · · + βk xik + i ) wi
where,
√
∗i = i × wi has homoscedastic variance σ 2
√ σ2
V ar(∗i ) = ( wi )2 ( wi
) = σ2
- Thus, with this rescaled variables, all inference can proceed as earlier.
-The estimates of the βj0 s obtained based on this weighted model are called
the weighted leaset squares (WLS) estimates. In matrix algebra, the vector of
WLS estimators can expressed as
17
where,
w1 0 ... 0
0 w2 ... 0
W= .. .. ..
. . .
0 0 ... wn
is a diagonal matrix hosting the n weights, and X is the design matrix of
the original, unscaled model. If W is the identity matrix , then,
Transformations
- Another approach that handles severe heteroscedasticity is to transform the
dependent variable, typically with a logarithmic transformations of the form
√
y ∗ = lny or square root transformation y ∗ = y, which serve to shrink the
spread-out data and restore homoscedasticity.
- Note:
If some of the response values are negative, making the logarithmic and square
root transformations not directly applicable, then it is common to add as suffi-
ciently large constant to each yi to make all response values positive.
limitations:
-They are often monotonic transformations, so they will not help with vari-
ability patterns that are non-monotonic.
18