100% found this document useful (2 votes)
21 views18 pages

Lecture 6

The document discusses collinearity in regression analysis, explaining its implications, detection methods, and strategies for handling it, such as removing variables, combining them, or using regularization techniques. It emphasizes the importance of identifying collinearity through tools like correlation matrices and variance inflation factors (VIF). Additionally, it introduces concepts like suppressor variables and goodness-of-fit measures for evaluating model performance.

Uploaded by

mike1226004050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
21 views18 pages

Lecture 6

The document discusses collinearity in regression analysis, explaining its implications, detection methods, and strategies for handling it, such as removing variables, combining them, or using regularization techniques. It emphasizes the importance of identifying collinearity through tools like correlation matrices and variance inflation factors (VIF). Additionally, it introduces concepts like suppressor variables and goodness-of-fit measures for evaluating model performance.

Uploaded by

mike1226004050
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

University of Windsor, Winter 2025

Actuarial Regression and Time Series


ACSC- 4200/ ACSC- 8200

Dr. Poonam S. Malakar

Collinearity
-Collinearity, or multicollinearity, occurs when one explanatory variable is, or
nearly is, a linear combination of the other explanatory variables.
- Intuitively, with collinear data it is useful to think of explanatory variables as
highly correlated with one another.
-If an explanatory variable is collinear, then the question arises as to whether it
is redundant, that is, whether the variable provides little additional information
beyond the information in the other explanatory variables.
- The issues are: Is collinearity important? If so, how does it affect our model
fit and how do we detect it? To address the first question, consider a somewhat
pathological example

Example1: Perfectly correlated explanatory variables


You are asked to fit the model E(y) = β0 + β1 x1 + β2 x2 to a dataset. THe
dataset under consideration is:

i 1 2 3 4
yi 23 83 63 103
xi1 2 8 6 10
xi2 6 9 8 10

Actually x2 = 5 + x1 /2 (perfect relationship between x1 and x2)

-To detect collinearity, begin with a matrix of correlation coefficients of the


explanatory variables.
-This matrix is simple to create, easy to interpret, and quickly captures linear
relationships between pairs of variables.
-A scatter plot matrix provides a visual reinforcement of the summary statistics
in the correlation matrix.

1
Correlation coefficient matrix

y x1 x2
y 1
x1 1 1
x2 1 1 1

If we run the model y = β0 + β1 x1 + β2 x2 +  we get the following results.

2
Collinearity Facts
• Collinearity precludes us neither from getting good fits nor from making
predictions of new observations. Note that in the prior example, we got
perfect fits.

• Estimates of error variances and, therefore, tests of model adequacy are


still reliable.

• In cases of serious collinearity, standard errors of individual regression


coefficients are greater than in cases where, other things equal, serious
collinearity does not exist. With large standard errors, individual regres-
sion coefficients may not be meaningful. Further, because a large standard
error means that the corresponding t-ratio is small, it is difficult to detect
the importance of a variable.

Options for handling Collinearity


Here are a few strategies to handle collinearity:

3
• Remove one of the correlated variables: If you have a set of predictor vari-
ables that are highly correlated, you can choose to keep only one of them
in the model. This approach ensures that the remaining variables are not
redundant and reduces the problem of collinearity. The selection can be
based on domain knowledge, statistical significance, or the strength of the
relationship with the outcome variable.

• Combine correlated variables: Instead of removing correlated variables,


you can create a single predictor variable by combining them. For exam-
ple, if you have two variables measuring similar concepts, you can calculate
their average or create an index that captures the common underlying con-
struct. This way, you retain the information while reducing collinearity.

• Regularization techniques: Regularization methods like ridge regression


and lasso regression can be effective in handling collinearity. These tech-
niques add a penalty term to the regression model, which shrinks the re-
gression coefficients and reduces the impact of collinear predictors. Ridge
regression, in particular, can be useful when you want to retain all the
predictors in the model but reduce their collinearity-related issues.

• Principal Component Analysis (PCA): PCA is a dimensionality reduction


technique that transforms a set of correlated predictors into a smaller set of
uncorrelated variables called principal components. By using the principal
components as predictors in the regression model, you can effectively han-
dle collinearity. However, it?s important to note that the interpretability
of the coefficients may be lost as the principal components are combina-
tions of the original predictors.

• Collect more data: Increasing the sample size can sometimes help alleviate
collinearity issues. With a larger dataset, the estimates of the regression
coefficients can become more stable and reliable, even in the presence of
collinearity.

Before applying these strategies, it’s crucial to identify and quantify the degree
of collinearity. Common diagnostics include calculating correlation coefficients,
variance inflation factors (VIF), and examining the eigenvalues of the correlation
matrix.

Variance Inflation factor


• Correlation and scatterplot matrices capture only relationships between
pairs of variables.

4
• To capture more complex relationships among several variables, we intro-
duce the variance inflation factor (VIF).

• To define a VIF, suppose that the set of explanatory variables is labeled


x1 , x2 , . . . , xk . Now, run the regression using xj as the response and the
other x0 s(x1 , x2 , . . . , xj−1 , xj+1 , . . . , xk ) as the explanatory variables. De-
note the coefficient of determination from this regression by Rj2 .

q
• we interpret Rj = Rj2 as the multiple correlation coefficient between xj
and linear combinations of the other x0 s.

• The variance inflation factor (VIF) for the j th predictor is defined as,
1
V IFj = 1−R 2 , for j = 1, 2, . . . , k.
j

• A larger Rj2 results in a larger V IFj ; this means greater collinearity be-
tween xj and the other x0 s.
• Now Rj 2 alone is enough to capture the linear relationship of interest.
However, we use V IFj instead of Rj 2 as our measure for collinearity be-
V IFj
cause of the algebraic relationship se(bj ) = s s j √n−1
x

Here, se(bj )and s are standard errors and residual standard deviation from
a full fit of y on x1 , . . . , xk

Further,p
sx j = (n − 1)−1 (xij − x¯j )2 the sample standard deviation of the j th
variable xj .

• As a commonly used rule of thumb, we say that severe collinearity is


present if there exists a VIF value that is greater than 10 (or equivalently,
any Rj2 > 0.9).

As an example, consider a regression of VOLUME on PRICE, SHARE,


and VALUE.(Data: Liquidity data)

> L i q u i d i t y <− r e a d . c s v ( ” / U s e r s / poonammalakar / Desktop /ACSC 4200/


L e c t u r e n o t e s / Data/ L i q u i d i t y . c s v ” , h e a d e r = TRUE, s t r i n g s A s F a c t o r s = FALSE)
> Model <− lm (VOLUME˜PRICE+SHARE+VALUE, data = L i q u i d i t y )
> summary ( Model )

Call :
lm ( f o r m u l a = VOLUME ˜ PRICE + SHARE + VALUE, data = L i q u i d i t y )

5
Residuals :
Min 1Q Median 3Q Max
−20.708 −4.179 −1.091 3.108 28.230

Coefficients :
Estimat e Std . E r r o r t v a l u e Pr ( >| t | )
( Intercept ) 7.90916 1.60140 4.939 2 . 5 9 e −06 ∗∗∗
PRICE −0.02224 0 . 0 3 5 0 3 −0.635 0.5267
SHARE 0.05372 0.01034 5.194 8 . 6 3 e −07 ∗∗∗
VALUE 0.31271 0.16150 1.936 0.0552 .
−−−
S i g n i f . codes : 0 ‘∗∗∗ ’ 0.001 ‘∗∗ ’ 0.01 ‘∗ ’ 0.05 ‘. ’ 0.1 ‘ ’ 1

R e s i d u a l s t a n d a r d e r r o r : 6 . 7 2 2 on 119 d e g r e e s o f freedom
M u l t i p l e R−s q u a r e d : 0 . 6 1 0 1 , Adjusted R−s q u a r e d : 0 . 6 0 0 3
F− s t a t i s t i c : 6 2 . 0 7 on 3 and 119 DF, p−v a l u e : < 2 . 2 e −16

> v i f ( Model )
PRICE SHARE VALUE
1.512535 3.827365 4.685602

Because each VIF statistic is less than 10, there is little reason to suspect
severe collinearity.

This is interesting because you may recall that there is a perfect relation-
ship among PRICE, SHARE, and VALUE in that we defined the market
value to be VALUE = PRICE × SHARE. However, the relationship is
multiplicative, and hence is nonlinear. Because the variables are not lin-
early related, it is valid to enter all three into the regression model. From
a financial perspective, the variable VALUE is important because it mea-
sures the worth of a firm. From a statistical perspective, the variable
VALUE quantifies the interaction between PRICE and SHARE (interac-
tion variables were introduced in Section 3.5.3).

What can we do in the presence of collinearity


• Recode the variables by “centering” - that is, subtract the mean and
divide by the standard deviation. For example, create a new variable
x∗ = (x − x̄)/s

• Ignore the collinearity in the analysis but comment on it in the interpre-


tation. Probably the most common approach. It is a fact of life that,
when dealing with business and economic data, collinearity tends to exist

6
among variables. Because the data tends to be observational in lieu of
experimental in nature, there is little that the analyst can do to avoid this
situation.

• Replace one or more variables by auxiliary variables or transformed ver-


sions.

• Remove one or more variables. Easy. Which One? is hard.


– Use interpretation. Which variable(s) do you feel most comfortable
with?

– Use automatic variable selection procedures to suggest a model.

Note: When severe collinearity exists, often the only option is to remove one
or more variables from the regression equation.

Example 2
For a linear regression model with two explanatory variables, X1 and X2 you
are given:

i. Rj2 is the coefficient of determination obtained from regressing the j th ex-


planatory variable on the other explanatory vriable.

ii. R12 = 0.95

Determine which of the explanatory variables will exceed the threshold of the
variance inflation factor for determining excessive collinearity.

Solution: In class

Suppressor Variables
• As we have seen, severe collinearity can seriously inflate standard errors
of regression coefficients.

• Because we rely on these standard errors for judging the usefulness of ex-
planatory variables, our model selection procedures and inferences may be
deficient in the presence of severe collinearity.

7
• Despite these drawbacks, mild collinearity in a dataset should not be
viewed as a deficiency of the dataset; it is simply an attribute of the
available explanatory variables.

• Even if one explanatory variable is nearly a linear combination of the oth-


ers, that does not necessarily mean that the information that it provides
is redundant.

• To illustrate, we now consider a suppressor variable, an explanatory vari-


able that increases the importance of other explanatory variables when
included in the model.

Suppressors are variables that when added to a regression model, change


the original relationship between x (a predictor) and y (the outcome) by
making it stronger, weaker, or no longer significant - or even reversing the
direction of the relationship (i.e., changing a positive relationship into a
negative one).

Example 3
Given below is the correlation matrix and scatter plot matrix of hypothetical
dataset of 50 observations with a response variable (y) and two explanatory
variables (x1 and x2 )

Correlation matrix for suppressor example

8
Scatterplot matrix of a response and two explanatory variable for the suppressor
variable example.

Here, we see that the two explanatory variables are highly correlated. Now
recall, for regression with one explanatory variable, that the correlation coeffi-
cient squared is the coefficient of determination. Thus,
• for a regression of y on x1 , the coefficient of determination is (0.188)2 = 3.5%.

• for a regression of y on x2 , the coefficient of determination is (−0.022)2 = 0.04%.

• for a regression of y on x1 and x2 , the coefficient of determination turns


out to be a surprisingly high 80.7%.

The interpretation is that individually, both x1 and x2 have little impact on y.


However, when taken jointly, the two explanatory variables have a significant
effect on y. Although the correlation matrix shows that x1 and x2 are strongly
linearly related, this relationship does not mean that x1 and x2 provide the
same information. In fact, in this example the two variables complement one
another.

Selection Criteria
Goodness of Fit
• Goodness of fit is a measure that evaluates how well an observed data set
fits a particular statistical model or hypothesis.

9
• It assesses the extent to which the model accurately represents the under-
lying patterns and variability in the data.

• In other words, it measures how well the model fits the observed data.

• Specifically, we interpret the fitted value yˆi to be the best model approxi-
mation of the ith observation and compare it to the actual value yi .

• In linear regression, we examine the difference through the residual e =


y − ŷ; small residuals imply a good model fit.

• We have quantified this through the size of the typical error (s), include
the coefficient of determination (R2 ) and an adjusted version (Ra2 ).

• For nonlinear models, we will need additional measures, and it is helpful


to introduce these measures in this simpler linear case.
– One such measure is Akaike’s information criterion (AIC). For linear
regression, it is defined as
AIC = nln(s2 ) + nln(2π) + n + 3 + k
For model comparison, the smaller is the AIC, the better is the fit.

– Later, we will introduce another measure, the Bayes information


criterion (BIC), which gives a smaller weight to the penalty for com-
plexity. For model comparison, the smaller is the BIC, the better is
the fit.

– A third goodness-of-fit measure that is used in linear regression model


is the Cp statistic.
To define this statistic, assume that we have available k explanatory
variables x1 , . . . , xk and run a regression to get sf ull 2 as the mean
square error.
Now, suppose that we are considering using only p − 1 explanatory
variables so that there are p regression coefficients(for model under
consideration). With these p − 1 explanatory variables, we run a re-
gression to get the error sum of squares (ErrorSS)p . Thus, we are
in the position to define

(ErrorSS)p
Cp = sf ull 2 − n + 2p

As a selection criterion, we choose the model with a “small” Cp co-


efficient, where small is taken to be relative to p. In general, models

10
with smaller values of Cp are more desirable.

Note: For most datasets, they recommend the same model, so an analyst
can report any or all three statistics. However, for some applications, they
lead to different recommended models. In this case, the analyst needs to rely
more heavily on non-data driven criteria for model selection (which are always
important in any regression application).

Example 4
An actuary relates insurance sales (y) to eight predictors x1 , x2 , . . . , x8 . She fits
two linear regression models to 27 observations. Model A contains all eight pre-
dictor variables, while model B contains only the first three predictors variables.
The ANOVA tables follow:

Model A
SOURCE SS df MS
Regressioin 115,175 8 14,397
Error 76,893 18 4,272
Total 192,068 26

Model B
SOURCE SS df MS
Regressioin 84,344 3 28,115
Error 107,724 23 4,684
Total 192,068 26

Use Frees’ formula, calculate Mallow’s Cp statistic for the three-predictor


model.

Solution: In class

Model Validation
Model validation is the process of confirming that our proposed model is appro-
priate, especially in light of the purposes of the investigation.

Out of sample validation


Out-of-sample validation is the a response to the data snooping in the drawback
1 of stepwise regression.

11
Idea
The ideal situation is to have available two sets of data, one for model develop-
ment and one for model validation.

• We initially develop one, or several, models on a first dataset. The models


developed from the first set of data are called our candidate models.

• Then, the relative performance of the candidate models could be measured


on a second set of data.

In this way, the data used to validate the model is unaffected by the procedures
used to formulate the model.
Unfortunately, rarely will two sets of data be available to the investigator. How-
ever, we can implement the validation process by splitting the dataset into two
subsamples. We call these the model development and validation subsamples,
respectively. They are also known as training and testing samples, respectively.

Out of Sample Validation Procedure


1. Begin with a sample size of n and divide it into two subsamples, the
model development and validation subsamples. Let n1 and n2 denote
the size of each subsample. In cross-sectional regression, do this split
using a random sampling mechanism. Use the notation i = 1, . . . , n1
to represent observations from the model development subsample and
i = n1 + 1, . . . , n1 + n2 = n for the observations from the validation
subsample.

2. Using the model development subsample, fit a candidate model to the


data set i = 1, . . . , n1 .

3. Using the model created in step (2) and the explanatory variables from
the validation subsample, predict the dependent variables in the validation
subsample,yˆi , where i = n1 + 1, . . . , n1 + n2 (To get these predictions, you
may need to transform the dependent variables back to the original scale.)

4. Assess the proximity of the predictions to the held-out data. One measure
is the sum of squared prediction errors:

Pn1 +n2
SSP E = i=n1 +1 (yi − yˆi )2

Repeat steps (2) through (4) for each candidate model. Choose the model with
the smallest SSPE.

12
Criticism of SSPE
1. First, it is clear that it takes a considerable amount of time and effort to
calculate this statistic for each of several candidate models. However, as
with many statistical techniques, this is merely a matter of having spe-
cialized statistical software available to perform the steps described earlier.

2. Second, because the statistic itself is based on a random subset of the


sample, its value will vary from analyst to analyst.

3. Third, and perhaps most important, is that the choice of the relative sub-
set sizes, n1 and n2 , is not clear. Various researchers recommend different
proportions for the allocation. For datasets with 500 or more observations,
use 50% of the sample for out-of-sample validation. Hastie, Tibshirani,
and Friedman (2001) remark that a typical split is 50% for development
and/or training, 25% for validation, and the remaining 25% for a third
stage for further validation that they call testing.

Cross-Validation
Cross-validation is the technique of model validation that splits the data into
two disjoint sets.

• We just discussed out-of-sample validation where the data was split ran-
domly into two subsets both containing a sizable percentage of data.

• Another popular method is leave-one-out cross-validation, where the val-


idation sample consists of a single observation and the development sam-
ple is based on the remainder of the dataset. Especially for small sample
sizes, an attractive leave-one-out cross-validation statistic is PRESS, the
predicted residual sum of squares.

To define the statistic, consider the following procedure in which we suppose


that a candidate model is available.

PRESS Validation Procedure


1. From the full sample, omit the ith point and use the remaining n − 1 ob-
servations to compute regression coefficients.

2. Use the regression coefficients computed in step one and the explanatory
variables for the ith observation to compute the predicted response, yˆi

13
This part of the procedure is similar to the calculation of the SSPE statis-
tic with n1 = n − 1 and n2 = 1.

3. Now, repeat (1) and (2) for i = 1, 2, . . . , n. Summarizing, define


Pn Pn ei
P RESS = i=1 (yi − ŷ(i) )2 = i=1 ( 1−h ii
)2

As with SSPE, this statistic is calculated for each of several competing


models. Under this criterion, we choose the model with the smallest
PRESS.

Advantages
• The PRESS statistic is less computationally intensive than SSPE.

• This procedure is more efficient, an especially important consideration


when the sample size is small (say, fewer than 50 observations).

Disadvantage
PRESS does not enjoy the appearance of independence between the estimation
and prediction aspects, unlike SSPE.

Extra Information

Heteroscedasticity
• When studying the linear regression model , we assume that the ran-
dom error terms across n observations possess the same variance, i.e
V ar(i ) = σ 2 for all i = 1, 2, . . . , n, for some unknown parameter σ 2 .

14
(homoscedastic assumption implies same amount of variability for all n
observations from their expected value). In practice, however, it is often
the case that the amount of uncertainty vary on case-by-case basis.

• Heteroscedasticity refers to a situation in regression analysis where the


variability of the errors(residuals) is not constant across the range of the
independent variables. In other words, the spread of the residuals tends to
vary systematically with the predicted values of the dependent variable.

Implications of heteroscedasticity for the LSEs


We saw in previous lectures that the least squares estimators are best lin-
ear unbiased estimator(BLUE). This optimality of estimator hinges upon
the validity of the model assumptions, including the constancy of the vari-
ance of the error term. When constant variance assumption is at stake,
so is the optimality of the least squares estimators.

– Unbiasedness: It turns out that the LSEs are still unbiased even in
the presence of heteroscedasticity. This follows from

E(ˆ(β) = (X 0 X)−1 X 0 E(Y ) = (X 0 X)−1 X 0 (Xβ) = β


Nothing about variances is used here.
– Efficiency: When the responses are heteroscedastic, the LSEs will
no longer possess the lowest variance among the class of unbiased
estimators. This motivates the use of alternative, more preicise esti-
amtors.
Thus, detecting heteroscedasticity is important because it violates one of the
assumptions of linear regression, which assumes homoscedasticity (constant vari-
ance of residuals). Presence of heteroscedasticity can lead to invalid hypothesis
tests, and inaccurate confidence intervals.

Detecting Heteroscedasticity
Before developing a strategy for dealing with heteroscedasticity, it is imperative
to assess, or detect its presence. or extent.
• Graphical method: Residual plots. One of the simplest ways to de-
tect heteroscedasticity is to plot the residuals against the predicted values
or the predictors. If the spread of the residuals increases or decreases
systematically as the predicted values change, it suggests the presence of
heteroscedasticity. Look for funnel-shaped patterns or increasing variabil-
ity as the predicted values increase.

15
• More formal test: Breusch-Pagen Test . The Breusch-Pagan test
is a statistical test specifically designed to detect heteroscedasticity. It
involves regressing the squared residuals on the independent variables and
then performing a hypothesis test on the coefficients. If the p-value is
significant, it suggests the presence of heteroscedasticity. However, this
test assumes that the errors follow a normal distribution.
To illustrate, let us consider a test due to Breusch and Pagan (1980). Specif-
ically, this test examines the alternative hypothesis

Ha : V ar(yi ) = σ 2 + zi0 γ, i = 1, 2, . . . , n

where,
σ 2 is baseline variance value

zi is a known vector of variables and

γ is a p-dimensional vector of parameters.

Thus, the null hypothesis isH0 : γ = 0, which is equivalent to homoscedas-


ticity, V ar(yi ) = σ 2

Breusch-Pagan testing procedure


1. Fit a regression model and calculate the model residuals, ei .

2. Calculate squared standardized residuals, e∗i 2 = e2i /s2

3. Fit a regression model of e∗ 2 on z and calculate the resulting regression


sum of squares.

4. The test statistic is LM = (RegressSSz )/2, where RegressSSz is the


regression sum of squares from the model fit in step (3) (not the Reg SS
of the original model).

The test statistic follow a chi-square distribution with p-degrees of free-


dom, denoted by χ2p under H0 ,where p is the dimension of γ

5. At the α significance level, we reject H0 in favour of Ha when the test


statistic exceeds the α - upper quantile of the χ2p distribution.

16
Solutions to Heteroscedasticity
Weighted least squares
-The least squares estimators are less useful for data sets with severe het-
eroscedasticity.

- One strategy is to use a variation of least squares estimation by weighting


observations.

-Specifically, suppose it is known that the random errors satisfy E(i ) = 0,


but vary with i such that
2
σ
V ar(i ) = w i
for some known weights w1 , w2 , . . . , wn , so that variablility is
proportional to a known weight wi .

- For example, if unit of analysis i represents a geographical entity such as


a state, you might use the number of people in the state as weight, or if i rep-
resents a firm, you might use firm assets for the weighting varible.

- Larger value of wi indicate a more precise response variable through the


smaller variability.

-Such model with heteroscedastic errors can be readily converted into a



model with homoscedastic ones by multiplying all variables by wi .
√ √
- If we define yi ∗ = yi × wi and xij ∗ = xij × wi

Then we have,
√ √
yi ∗ = yi × wi = (β0 xi0 + β1 xi1 + · · · + βk xik + i ) wi

= β0 x∗i0 + β1 x∗i1 + · · · + βk x∗ik + ∗i

where,

∗i = i × wi has homoscedastic variance σ 2
√ σ2
V ar(∗i ) = ( wi )2 ( wi
) = σ2

- Thus, with this rescaled variables, all inference can proceed as earlier.

-The estimates of the βj0 s obtained based on this weighted model are called
the weighted leaset squares (WLS) estimates. In matrix algebra, the vector of
WLS estimators can expressed as

bW LS = (X0 WX)−1 X0 WY,

17
where,
 
w1 0 ... 0
0 w2 ... 0
 

W= .. .. .. 
 . . .


0 0 ... wn
is a diagonal matrix hosting the n weights, and X is the design matrix of
the original, unscaled model. If W is the identity matrix , then,

bW LS = (X0 X)−1 X0 Y, which is the usual least squares estimator.

Transformations
- Another approach that handles severe heteroscedasticity is to transform the
dependent variable, typically with a logarithmic transformations of the form

y ∗ = lny or square root transformation y ∗ = y, which serve to shrink the
spread-out data and restore homoscedasticity.

- Note:
If some of the response values are negative, making the logarithmic and square
root transformations not directly applicable, then it is common to add as suffi-
ciently large constant to each yi to make all response values positive.

For percentage data with negative percentages, the transformation


100ln(1 + y/100) can be used.

limitations:

-They are often monotonic transformations, so they will not help with vari-
ability patterns that are non-monotonic.

- If the data is reasonably symmetric but heteroscedastic, a transformation


can mitigate the heteroscedasticity, but will undesirably skew the distribution,
destroying the symmetry.

18

You might also like