Regression Analysis 2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 92

Correlation & Regression Analysis

Advanced Data Analysis, AY 2022-23

01 November 2022

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 1 / 92
Variance
Variance of a single variable represents the average amount that the data vary from the
mean.

(xi − x̄ )2
P
2
s =
n−1

The mean of the sample is represented by x̄ , xi is the data point in question and n is the
number of observations.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 2 / 92
Covariance
If there were a relationship between these two variables, then as one variable deviates
from its mean, the other variable should deviate from its mean in the same or the directly
opposite way.
Calculating the covariance is a good way to assess whether two variables are related to
each other.

P
(xi − x̄ )(yi − ȳ )
cov (x , y ) =
n−1

A positive covariance indicates that as one variable deviates from the mean, the other
variable deviates in the same direction.
On the other hand, a negative covariance indicates that as one variable deviates from the
mean (e.g., increases), the other deviates from the mean in the opposite direction (e.g.,
decreases).

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 3 / 92
R Code
We will use adverising dataset (Advertising.csv).

data <- read.csv("Advertising.csv")


data$X <- NULL
cov(data$TV, data$sales)

## [1] 350.3902

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 4 / 92
Standardization & Correlation Coefficient
Covariance depends upon the scales of measurement used. So, covariance is not a
standardized measure.
This dependence on the scale of measurement is a problem because it means that we
cannot compare covariances in an objective way – so, we cannot say whether a covariance
is particularly large or small relative to another data set unless both data sets were
measured in the same units.
The standardized covariance is known as a correlation coefficient and is defined by
P
cov (x , y ) (xi − x̄ )(yi − ȳ )
r= =
sx sy (n − 1)sx sy

The coefficient in equation is known as the Pearson product-moment correlation


coefficient or Pearson correlation coefficient.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 5 / 92
Interpretation
Coefficient of +1 indicates that the two variables are perfectly positively correlated, so as
one variable increases, the other increases by a proportionate amount.
Conversely, a coefficient of -1 indicates a perfect negative relationship: if one variable
increases, the other decreases by a proportionate amount.
A coefficient of zero indicates no linear relationship at all and so if one variable changes,
the other stays the same.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 6 / 92
R Code
cor(data$TV, data$sales)

## [1] 0.7822244

cor.matrix <- cor(data)


round(cor.matrix,2)

## TV radio newspaper sales


## TV 1.00 0.05 0.06 0.78
## radio 0.05 1.00 0.35 0.58
## newspaper 0.06 0.35 1.00 0.23
## sales 0.78 0.58 0.23 1.00

print(cor.matrix)

## TV radio newspaper sales


## TV 1.00000000 0.05480866 0.05664787 0.7822244
## radio 0.05480866 1.00000000 0.35410375 0.5762226
## newspaper 0.05664787 0.35410375 1.00000000 0.2282990
## sales 0.78222442 0.57622257 0.22829903 1.0000000
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 7 / 92
Correlation vs. Causality
Considerable caution must be taken when interpreting correlation coefficients because
they give no indication of the direction of causality.
So, in our example, although we can conclude that as expenditure on TV advertisements
increases, the sales increases, we cannot say that watching adverts on TV causes you to
buy that product.

In any correlation, causality between two variables cannot be assumed because there
may be other measured or unmeasured variables affecting the results.
Even if we could ignore the third-variable problem described above, and we could
assume that the two correlated variables were the only important ones, the
correlation coefficient doesn’t indicate in which direction causality operates.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 8 / 92
Coefficient of Determination (R 2 )
The correlation coefficient squared (known as the coefficient of determination, R 2 ) is a
measure of the amount of variability in one variable that is shared by the other.

cor.matrixˆ2

## TV radio newspaper sales


## TV 1.000000000 0.00300399 0.003208982 0.61187505
## radio 0.003003990 1.00000000 0.125389466 0.33203246
## newspaper 0.003208982 0.12538947 1.000000000 0.05212045
## sales 0.611875051 0.33203246 0.052120445 1.00000000

If we convert this value into a percentage (multiply by 100) we can say that TV
expenditures shares 61.2% of the variability in sales.
So, although TV expenditures was highly correlated with sales, it can account for
only 61.2% of variation in sales. To put this value into perspective, this leaves
38.8% of the variability still to be accounted for by other variables.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 9 / 92
Pearson Product-moment Correlation Coefficient
Assumptions

1 Level of Measurement: The two variables should be measured at the interval or


ratio level.
2 Linear Relationship: There should exist a linear relationship between the two
variables.
3 Normality: Both variables should be roughly normally distributed.
4 Related Pairs: Each observation in the dataset should have a pair of values.
5 No Outliers: There should be no extreme outliers in the dataset.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 10 / 92
Pearson’s Correlation Coefficient Test
H0 : Population correlation coefficient (r ) is not significantly different from 0
HA : Population correlation coefficient (r ) is significantly different from 0
t test statistic: √
r
t= √ n−2
1−r 2

The p-value (significance level) of the correlation can be determined :


Degrees of freedom: df = n − 2, where n is the number of observation in X and Y
variables.
If the p-value is < 5%, then the correlation between X and Y is significant.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 11 / 92
R Code

cor.test(data$TV, data$sales, method = "pearson")

##
## Pearson's product-moment correlation
##
## data: data$TV and data$sales
## t = 17.668, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7218201 0.8308014
## sample estimates:
## cor
## 0.7822244

Report: There was a significant relationship between the expenditures on TV


advertisements and sales, r = .78, p (one-tailed) < .05.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 12 / 92
Checking Normality Assumption
H0 : Variable follows a normal distribution
H1 : Variable follows a distribution significantly different from normal distribution
p-value greater than α (say, 0.05) indicates variable follows normal distribution.
shapiro.test(data$TV)

##
## Shapiro-Wilk normality test
##
## data: data$TV
## W = 0.94951, p-value = 1.693e-06
shapiro.test(data$sales)

##
## Shapiro-Wilk normality test
##
## data: data$sales
## W = 0.97603, p-value = 0.001683

We may conclude that both TV and sales are significantly different from normal distribution.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 13 / 92
Spearman’s correlation coefficient
Spearman’s correlation coefficient, ρ, is a non-parametric statistic and so can be used
when the data have violated parametric assumptions such as non-normally distributed
data.
Spearman’s test works by first ranking the data, and then applying Pearson’s equation to
those ranks.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 14 / 92
R Code

cor(data$TV, data$sales, method = "spearman")

## [1] 0.8006144

cor.test(data$TV, data$sales, exact = FALSE, method = "spearman")

##
## Spearman's rank correlation rho
##
## data: data$TV and data$sales
## S = 265841, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8006144

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 15 / 92
Kendall’s Tau (τ )
Kendall’s tau, τ , is another non-parametric correlation and it should be used rather than
Spearman’s coefficient when you have a small data set with a large number of tied ranks.
This means that if you rank all of the scores and many scores have the same rank, then
Kendall’s τ should be used.
Although Spearman’s statistic is the more popular of the two coefficients, there is much
to suggest that Kendall’s statistic is actually a better estimate of the correlation in the
population.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 16 / 92
R Code

cor(data$TV, data$sales, method = "kendall")

## [1] 0.6219464

cor.test(data$TV, data$sales, method = "kendall")

##
## Kendall's rank correlation tau
##
## data: data$TV and data$sales
## z = 13.041, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6219464

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 17 / 92
Biserial and point-biserial correlations
These correlation coefficients are used when one of the two variables is dichotomous (i.e.,
it is categorical with only two categories) and other variable is continuous (roughly follows
normal distribution).
Point-biserial correlation coefficient (rpb ) is used when one variable is a discrete dichotomy
(e.g., gender).
The biserial correlation coefficient (rb ) is used when one variable is a continuous
dichotomy (e.g., passing or failing an exam).

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 18 / 92
R Code - Point Biserial Correlation

salary <- c(10.3, 18.4, 12.2, 14.6, 10.0, 22.3, 25.8, 15.1, 19.3, 18.4, 10.9)
gender <- c("M", "M", "F", "F", "M", "F", "F", "F", "F", "M", "M")
gender.fac <- factor(gender)
gender.fac

## [1] M M F F M F F F F M M
## Levels: F M

gender.num <- as.numeric(factor(gender))


gender.num

## [1] 2 2 1 1 2 1 1 1 1 2 2

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 19 / 92
R Code - Point Biserial Correlation

cor.test(salary,gender.num)

##
## Pearson's product-moment correlation
##
## data: salary and gender.num
## t = -1.5725, df = 9, p-value = 0.1503
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8323297 0.1879700
## sample estimates:
## cor
## -0.4642536

At 5% level of significance salary and gender does not have significant correlation (rb = −0.464,
p > 0.05).

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 20 / 92
R Code - Biserial Correlation

spending <- c(28.8, 15.3, 19.4, 5.4, 18.6, 24.2,


26.9, 21.3, 30.8, 14.6, 26.6, 25.8)
salary <- c(37.6, 63.3, 55.7, 56.4, 27.1, 51.9,
37.7, 59.5, 52.3, 29.5, 53.7, 32.7)
sal_group <- ifelse(salary > mean(salary), "High", "Low")
sal_group

## [1] "Low" "High" "High" "High" "Low" "High" "Low" "High" "High" "Low"
## [11] "High" "Low"

library(polycor)
polyserial(spending, sal_group)

## [1] 0.2252382

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 21 / 92
Polychoric Correlation
Polychoric correlation is a technique for estimating the correlation between two observed ordinal
variables.
Take 20 customers who have used the product and have them rate their overall likeability on a
scale of 1 to 5, where:
1 denotes “dislike a lot,” 2 denotes “dislike a little,” 3 denotes “neither like nor dislike,” 4 denotes
“like it a little,” and 5 denotes “like it a lot.”

library(polycor)
pr1 <- c(5, 5, 4, 2, 2, 3, 3, 5, 2, 5, 3, 5, 4, 5, 4, 4, 5, 4, 5, 5)
pr2 <- c(2, 2, 3, 4, 5, 3, 4, 2, 1, 4, 1, 5, 2, 2, 5, 5, 1, 5, 1, 2)
polychor(pr1, pr2)

## [1] -0.225602

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 22 / 92
Cramer’s V
Cramer’s V (sometimes referred to as Cramer’s ϕ and denoted as ϕc ) is a measure of association
between two nominal variables, giving a value between 0 and +1 (inclusive).

library(rcompanion)
x<- c(3, 1, 1, 3, 2, 1, 3, 2, 2, 1, 1, 3, 1, 3, 3, 2, 3, 3, 3, 3)
y<- c(2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1)
mat <- table(x,y)
colnames(mat) <- c("No", "Yes") # Having caredit card
rownames(mat) <- c("Faculty", "Student", "Staff")
mat

## y
## x No Yes
## Faculty 3 3
## Student 3 1
## Staff 5 5

cramerV(mat)

## Cramer V
## 0.201

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 23 / 92
Partial Correlation
A correlation between two variables in which the effects of other variables are held constant is
known as a partial correlation.

ex <- read.table("exam_Anxiety.dat", header=TRUE)


head(ex)

## Code Revise Exam Anxiety Gender


## 1 1 4 40 86.298 Male
## 2 2 11 65 88.716 Female
## 3 3 27 80 70.178 Male
## 4 4 53 80 61.312 Male
## 5 5 4 40 89.522 Male
## 6 6 22 70 60.506 Female

cor(ex[, c("Exam", "Anxiety", "Revise")])

## Exam Anxiety Revise


## Exam 1.0000000 -0.4409934 0.3967207
## Anxiety -0.4409934 1.0000000 -0.7092493
## Revise 0.3967207 -0.7092493 1.0000000

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 24 / 92
Partial Correlation
Exam performance was negatively related to exam anxiety.
Exam performance was positively related to revision time.
Revision time itself was negatively related to exam anxiety.

This scenario is complex, but given that we know that revision time is related to both exam
anxiety and exam performance, then if we want a pure measure of the relationship between exam
anxiety and exam performance we need to take account of the influence of revision time.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 25 / 92
Partial Correlation

cor(ex[, c("Exam", "Anxiety", "Revise")])ˆ2*100

## Exam Anxiety Revise


## Exam 100.00000 19.44752 15.73873
## Anxiety 19.44752 100.00000 50.30345
## Revise 15.73873 50.30345 100.00000

Exam anxiety accounts for 19.4% of the variance in exam performance, that revision time
accounts for 15.7% of the variance in exam performance, and that revision time accounts for
50.2% of the variance in exam anxiety.
If revision time accounts for half of the variance in exam anxiety, then it seems feasible that at
least some of the 19.4% of variance in exam performance that is accounted for by anxiety is the
same variance that is accounted for by revision time.

As such, some of the variance in exam performance explained by exam anxiety is not unique and
can be accounted for by revision time.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 26 / 92
R Code
library(ppcor)
pc<-pcor(ex[, c("Exam", "Anxiety", "Revise")])
pc

## $estimate
## Exam Anxiety Revise
## Exam 1.0000000 -0.2466658 0.1326783
## Anxiety -0.2466658 1.0000000 -0.6485301
## Revise 0.1326783 -0.6485301 1.0000000
##
## $p.value
## Exam Anxiety Revise
## Exam 0.00000000 1.244581e-02 1.837308e-01
## Anxiety 0.01244581 0.000000e+00 1.708019e-13
## Revise 0.18373076 1.708019e-13 0.000000e+00
##
## $statistic
## Exam Anxiety Revise
## Exam 0.000000 -2.545307 1.338617
## Anxiety -2.545307 0.000000 -8.519961
## Revise 1.338617 -8.519961 0.000000
##
## $n Data Analysis, AY 2022-23
Advanced Correlation & Regression Analysis 01 November 2022 27 / 92
R Code

print(pc$estimateˆ2*100)

## Exam Anxiety Revise


## Exam 100.000000 6.084403 1.760352
## Anxiety 6.084403 100.000000 42.059126
## Revise 1.760352 42.059126 100.000000

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 28 / 92
Interpretation
The partial correlation between exam performance and exam anxiety is -.247, which is
considerably less than the correlation when the effect of revision time is not controlled for
(r = −.441).
This correlation is still statistically significant (its p-value is .012, which is still below .05).
In terms of variance, the value of R 2 for the partial correlation is .06, which means that
exam anxiety can now account for only 6% of the variance in exam performance. When the
effects of revision time were not controlled for, exam anxiety shared 19.4% of the variation
in exam scores and so the inclusion of revision time has severely diminished the amount of
variation in exam scores shared by anxiety.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 29 / 92
Visualizing Correlations - Scatterplot
library("ggpubr")
ggscatter(data, x = "TV", y = "sales",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "TV Expenditures", ylab = "sales")

R = 0.78, p < 2.2e−16

20
sales

10

0 100 200 300


TV Expenditures

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 30 / 92
Visualizing Correlations - Scatterplot Matrix

library("PerformanceAnalytics")
chart.Correlation(data, histogram=TRUE, pch=19)

0 10 20 30 40 50 5 10 15 20 25

***
TV

250
0.78
Density

150
0.055 0.057

50
0
50

*** ***
radio
40

0.58
30

Density

0.35
20
10
0

newspaper

**

80
x Density

0.23

60
40
20
0
25

sales
20

Density
15
10
5

0 50 100 150 200 250 300 0 20 40 60 80 100

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 31 / 92
Visualizing Correlations - Correlogram

library(corrplot)
corrplot(cor.matrix, is.corr = FALSE, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)

er
ap
sp
s

io

w
le

ne
TV

sa

ra
1

0.91
TV

0.81

0.72

sales
0.62

0.53

0.43
radio

0.34

0.24

newspaper
0.15

0.05

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 32 / 92
Regression Framework
The regression framework can be characterized in the following way:

We have one target variable (also response or dependent variable), y , that we are
interested in understanding or modeling, such as sales of a particular product, sale
price of a home etc.
We have a set of p predictor or independent variables, x1 , x2 , . . . , xp that we think
might be useful in predicting or modeling the target variable (the price of the
product, the competitor’s price, and so on).

Typically, a regression analysis is used for one (or more) of three purposes

1 modeling the relationship between x and y ;


2 prediction of the target variable (forecasting); and
3 testing of hypotheses.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 33 / 92
The Linear Regression Model
The data consist of n sets of observations {x1 , x2 , . . . , xp , y }, which represent a random
sample from a larger population. It is assumed that these observations satisfy a linear
relationship,

yi = β0 + β1 x1i + β2 x2i + . . . + βp xpi + ϵi

where the coefficients, β are unknown parameters, and the ϵi are random error terms.
The special case of with p = 1 corresponds to the simple regression model.
A primary goal of a regression analysis is to estimate this relationship, or equivalently, to
estimate the unknown parameters β.
The standard approach is least square regression.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 34 / 92
Fitted Value and Residual
Using least square techniques, the unknown coefficients β are chosen by minimizing the
sum of squared errors:
n
X
ϵ2i = [yi − (β0 + β1 x1i + β2 x2i + . . . + βp xpi )]2
i=1

For any choice of estimated parameters β̂, the estimated expected response value given
the observed predictor values equals

ŷi = β̂0 + β̂1 x1i + β̂2 x2i + . . . + β̂p xpi

and is called the fitted value.


The difference between the observed value yi and the fitted value ŷi is called the residual.
The ordinary least squares (OLS) estimates minimize the sum of squares of the residuals.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 35 / 92
Estimating β using Least Squares
Define the following matrix and vectors as follows:

1 x11 x12 ··· x1p y1 β0 ϵ1


       
1 x21 x22 ··· x2p  y2  ′ β1   ϵ2 
X=
 ... .. .. ..  , y =  ...  , β =  ...  , ϵ =  ... 
..       
. . . .
1 xn1 xn1 ··· xnp yn βp ϵn

The regression model is then


y = Xβ + ϵ.

Using multivariate calculus it can be shown,

(X′ X)β̂ = X′ y.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 36 / 92
Estimating β using Least Squares
Therefore, the least squares estimates satisfy

β̂ = (X′ X)−1 X′ y.

The fitted values are then

y = Xβ̂ = X(X′ X)−1 X′ y ≡ Hy.


^

where H = X(X′ X)−1 X′ is the so-called “hat” matrix (since it takes y to ^


y). The residuals
e = y−^y thus satisfy

y = y − X(X′ X)−1 X′ y = (I − X(X′ X)−1 X′ )y = (I − H)y.


e = y−^

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 37 / 92
R Code
We will use adverising dataset (Advertising.csv).

data <- read.csv("Advertising.csv")


data$X <- NULL
X <- cbind(1,as.matrix(data[,-4]))
y <- as.matrix(data$sales)
beta.hat <- solve((t(X)%*%X))%*%t(X)%*%y
print(beta.hat)

## [,1]
## 2.938889369
## TV 0.045764645
## radio 0.188530017
## newspaper -0.001037493

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 38 / 92
R Code

hat <- X%*%solve(t(X)%*%X)%*%t(X); dim(hat) # Hat matrix

## [1] 200 200

fitted <- hat%*%y; dim(fitted) # Fitted values

## [1] 200 1

residual <- (diag(nrow(y)) - hat)%*%y; summary(residual) # Residuals

## V1
## Min. :-8.8277
## 1st Qu.:-0.8908
## Median : 0.2418
## Mean : 0.0000
## 3rd Qu.: 1.1893
## Max. : 2.8292
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 39 / 92
Assumptions

1 The expected value of the errors is zero (E (ϵi ) = 0 for all i).
2 The variance of the errors is constant (V (ϵi ) = σ 2 for all i). This assumption of
constant variance is called homoscedasticity, and its violation (nonconstant
variance) is called heteroscedasticity.
3 The errors are uncorrelated with each other.This violation most often occurs in data
that are ordered in time (time series data), where errors that are near each other in
time are often similar to each other (such time-related correlation is called
autocorrelation).
4 The errors are normally distributed.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 40 / 92
Interpreting Regression Coefficients
β̂0 : The estimated expected value of the target variable when the predictors all equal zero.
The estimated coefficient for the jth predictor (j = 1, . . . , p) is interpreted in the following
way.
β̂j : The estimated expected change in the target variable associated with a one unit
change in the j-th predicting variable, holding all else in the model fixed.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 41 / 92
Sum of Squares
Pn
Variability in the target variable, termed as corrected sum of squares: i=1
(yi − Ȳ )2 .
The variability
Pn left over after doing the regression, termed as residual sum of squares
(RSS): i=1
(yi − ŷ )2 .
Variability
Pn accounted for by doing the regression, termed as regression sum of squares:
i=1
(ŷ − Ȳ )2 .
The least squares estimates possess an important property:
Corrected sum of squares = Residual sum of squares + Regression sum of squares

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 42 / 92
R Code

css <- sum((data$sales - mean(data$sales))ˆ2)


rss <- sum(residualˆ2)
regss <- sum((fitted - mean(data$sales))ˆ2)

print(css)

## [1] 5417.149

print(rss+regss)

## [1] 5417.149

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 43 / 92
Measuring the Strength of Regression Relationship
A measure of the strength of the regression relationship can be the ratio of variation
explained by the model to the total variation, i.e,
Pn
(ŷ − Ȳ )2 Regression SS Residual SS
Pni=1 ≡ ≡1−
i=1
(yi − Ȳ )
2 Corrected SS Corrected SS

This is coefficient of determination or R 2 . It can be shown that R 2 is square of the


correlation between actual and fitted y .

regss/css

## [1] 0.8972106

cor(data$sales, fitted)ˆ2

## [,1]
## [1,] 0.8972106

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 44 / 92
Adjusted R 2
It can be shown that R 2 is biased upwards as an estimate of the population proportion of
variability accounted for by the regression. The adjusted R 2 corrects this bias, and equals
p
Ra2 = R 2 − (1 − R 2 )
n−p−1

Unless p is large relative to n − p − 1 (that is, unless the number of predictors is large
relative to the sample size), R 2 and Ra2 will be close to each other, and the choice of
which to use is a minor concern.
R 2 provides an explicit tradeoff between the strength of the fit (the first term, with larger
R 2 corresponding to stronger fit and larger Ra2 ) and the complexity of the model (the
second term, with larger p corresponding to more complexity and smaller Ra2 ).

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 45 / 92
R Code

n <- nrow(y)
p <- ncol(X) - 1 # One less due to column of ones
r2 <- regss/css
adj_r2 <- r2 - (p/(n-p-1))*(1-r2)
print(adj_r2)

## [1] 0.8956373

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 46 / 92
Variance of the Errors σ 2
An unbiased estimate is provided by the residual mean square,
Pn
(yi − ŷ )2
σ̂ 2 = i=1
n−p−1
Recall that the model assumes that the errors are normally distributed with standard
deviation σ. This means that, roughly speaking, 95% of the time an observed y value falls
within ±2σ of the expected response

E (y ) = β0 + β1 x1 + β2 x2 + . . . + βp xp .

E (y ) can be estimated for any given set of x values using

ŷ = β̂0 + β̂1 x1 + β̂2 x2 + . . . + β̂p xp .

while the square root of the residual mean square, termed the residual standard error of
the estimate, provides an estimate of a that can be used in constructing this rough
prediction interval ±2σ̂.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 47 / 92
R Code

rms <- rss/(n-p-1);


rse <- sqrt(rms)
cat("Residual standard error:", rse, "on", n-p-1, "DF")

## Residual standard error: 1.68551 on 196 DF

x0 <- c(1, 125, 45, 55) # New observation


# predicted sales value for new observation
y.pred <- sum(beta.hat*x0)
c(y.pred-2*rms, y.pred+2*rms) #Rough 95% prediction interval

## [1] 11.40437 22.76815

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 48 / 92
Hypothesis Tests for β
There are two types of hypothesis tests related to the regression coefficients of immediate
interest.

1 Do any of the predictors provide predictive power for the target variable?

This is a test of the overall significance of the regression.

2 Given the other variables in the model, does a particular predictor provide additional
predictive power?

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 49 / 92
Test of the Overall Significance of the Regression

H0 : β1 = . . . = βp = 0
versus
HA : Some βj = 0, j = 1, . . . , p.
The test of these hypotheses is the F -test,
Regression MS Regression SS/p
F = =
Residual MS Residual SS/(n − p − 1)

This is referenced against a null F -distribution on (p, n − p − 1) degrees of freedom.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 50 / 92
R Code

f.stat <- (regss/p)/(rss/(n-p-1))


f.df <- c(p, n-p-1)
fp.val <- pf(f.stat, p,n-p-1, lower.tail = FALSE)
cat("F-statistic:", round(f.stat,2),
"on DF", p, "and", n-p-1, "\n",
"p-value", fp.val )

## F-statistic: 570.27 on DF 3 and 196


## p-value 1.575227e-96

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 51 / 92
Test of the Significance of an Individual Coefficient

H0 : βj = 0, j = 1, . . . , p,
versus
HA : βj ̸= 0.
This is tested using a t-test,
β̂j
tj =
s.e.(
c β̂j )
which is compared to a t-distribution on n − p − 1 degrees of freedom. The values of
.e(β̂j ) are obtained as the square roots of the diagonal elements of V̂ (β̂) = (X′ X)−1 σ̂ 2 ,
sc
where σ̂ 2 is the residual mean square.
A t-test for the intercept also can be constructed, although this does not refer to a
hypothesis about a predictor, but rather about whether the expected target is equal to a
specified value if all of the predictors equal zero.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 52 / 92
R Code

var.beta <- diag(solve(t(X)%*%X)*rms)


se.beta <- sqrt(var.beta)
tx <- beta.hat/se.beta
tp.val <- 2 * pt(abs(tx), df = n-p-1, lower.tail=FALSE)
cbind(tx, se.beta, tp.val)

## se.beta
## 9.4222884 0.311908236 1.267295e-17
## TV 32.8086244 0.001394897 1.509960e-81
## radio 21.8934961 0.008611234 1.505339e-54
## newspaper -0.1767146 0.005871010 8.599151e-01

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 53 / 92
Confidence Intervals for β
A confidence interval provides an alternative way of summarizing the degree of precision in
the estimate of a regression parameter.
A 100x (1 − α)% confidence interval for βj has the form

β̂j ± tα/2,n−p−1 s.e.(


c β̂j )

where tα/2,n−p−1 is the appropriate critical value at two-sided level α for a t-distribution
on n − p − 1 degrees of freedom.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 54 / 92
R Code

alpha <- .05


t.cr <- qt(p=alpha/2, df=n-p-1, lower.tail=FALSE)
lb <- beta.hat - t.cr*se.beta
ub <- beta.hat + t.cr*se.beta
cbind(lb,ub)

## [,1] [,2]
## 2.32376228 3.55401646
## TV 0.04301371 0.04851558
## radio 0.17154745 0.20551259
## newspaper -0.01261595 0.01054097

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 55 / 92
Constructing Prediction Intervals
This interval provides guidance as to how precise ŷ0 is as a prediction of y for some
particular specified value x0 , where ŷ0 is determined by substituting the values x0 into the
estimated regression equation.
The prediction interval is then

c 0P )
ŷ0 ± tα/2,n−p−1 s.e.(ŷ

where, hp i
c 0P ) =
s.e.(ŷ 1 + x0′ (X′ X)−1 x0 σ̂.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 56 / 92
R Code

# Here we construct prediction interval for y-hat w.r.t x0


se.pred <- sqrt(1+x0%*%solve(t(X)%*%X)%*%x0)*rse
print(se.pred)

## [,1]
## [1,] 1.701056

# 95% prediction interval


c(y.pred-t.cr*se.pred, y.pred+t.cr*se.pred)

## [1] 13.73154 20.44098

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 57 / 92
Confidence Interval for a Fitted Value
The prediction interval is used to provide an interval estimate for a prediction of y for one
member of the population with a particular value of x0 .
The confidence interval is used to provide an interval estimate for the true expected value
of y for all members of the population with a particular value of x0 .
The confidence interval for a fitted value is then

c 0F )
ŷ0 ± tα/2,n−p−1 s.e.(ŷ

where, hp i
c 0F ) =
s.e.(ŷ x0′ (X′ X)−1 x0 σ̂.

The confidence interval for a fitted value will always be narrower than the prediction
interval (due to the absense of the extra σ̂ 2 in the equation).

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 58 / 92
R Code

# Here we construct prediction interval for y-hat w.r.t x0


se.fit <- sqrt(x0%*%solve(t(X)%*%X)%*%x0)*rse
print(se.fit)

## [,1]
## [1,] 0.2294476

# 95% prediction interval


c(y.pred-t.cr*se.fit, y.pred+t.cr*se.fit)

## [1] 16.63376 17.53876

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 59 / 92
Fitting a Linear Regression Model in R
lm.fit <- lm(sales ~ ., data = data)
summary(lm.fit)

##
## Call:
## lm(formula = sales ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 60 / 92
Related R Functions

# To return coefficients from model object


coef(lm.fit)
# To return fitted values from model object
fitted(lm.fit)
# To return residuals from model object
residuals(lm.fit)
# To return studentized (standardized) residuals from model object
rstandard(lm.fit)
# To return cofidence intervals from model object
confint(lm.fit)

# To predict y from new x


new_x <- data.frame("TV" = c(125,270.5),
"radio" = c(45,27.4),
"newspaper" = c(55,8.3))
predict(lm.fit, new_x, interval = "prediction")

## fit lwr upr


## 1 17.08626 13.73154 20.44098
## 2 20.47534 17.11241 23.83826

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 61 / 92
Diagnostic Plots

1 Residuals vs fitted
2 Normal Q-Q
3 Scale-location
4 Cook’s distance
5 Residuals vs. leverage
6 Cook’s distance vs leverage

R function used:

library(ggfortify)
autoplot(lm.fit, which = 1:6, ncol = 3, label.size = 3)

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 62 / 92
Diagnostics Plots
Residuals vs Fitted Normal Q−Q Scale−Location
2
3 131

2.0

Standardized residuals
Standardized residuals
0 0 6
179
1.5
Residuals

−3
−2
179 179 1.0

6 6
−6
−4 0.5

−9 131 131
0.0
5 10 15 20 25 −3 −2 −1 0 1 2 3 5 10 15 20 25
Fitted values Theoretical Quantiles Fitted values

Cook's distance Residuals vs Leverage Cook's dist vs Leverage


2
131 131
Standardized Residuals

0
0.2 0.2
Cook's distance

Cook's distance
6 −2 76 6

0.1 0.1
6
76 −4 76

0.0 131 0.0


0 50 100 150 200 0.000 0.025 0.050 0.075 0.000 0.025 0.050 0.075
Obs. Number Leverage Leverage

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 63 / 92
Residuals vs Fitted Plot

A flat and horizontal zero (dashed) line no distinctive pattern in the plot suggest
that the assumption that the relationship is linear is reasonable.
If the residuals are close to 0 for small fitted values and are more spread out for
large fitted values (fanning effect) or the residuals are spread out for small fitted
values and close to 0 for large fitted values (funneling effect), or varies in some
complex fashion, suggest non-constant error variance.
No single residual stands apart from the basic random pattern of residuals suggests
that there are no outliers.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 64 / 92
Standardized Residuals & Leverages
Standardized residuals for each observations i = 1, 2, . . . , n can be expressed as
ei
ri = √ ,
se i − hi
where, hi are the diagonals elements of H matrix also known as leverages and it can be
shown that, when n is large,
ri ∼ N(0, 1).

Large values of hi indicate extreme values in X, which may influence regression. Note that
leverages only depend on X. .

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 65 / 92
Normal Q-Q Plot
We can check normality assumption of of errors using Q-Q plot. Ideally all the points
should fall approximately along this 45-degree reference (dashed) line. The greater the
departure from this reference line, the greater the the chance of being non-normal errors.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 66 / 92
Scale-location Plot
From the residuals vs fitted plot, it is difficult to identify an outlier (residuals are not
standardized).
In scale-location plot, points would be flagged as outliers if standardized residuals are
greater +2 or smaller than -2.
To meet the constant error variance assumption, Then the average magnitude of the
standardized residuals (blue line) should be horizontal and the points should randomly
placed around the blue line.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 67 / 92
Influential Points & Cook’s Distance
Observations that have high leverage and large residual are known as influential points.
A common measure of influence is Cook’s Distance, which is defined as
1 2 hi
Di = ri .
p 1 − hi

A Cook’s Distance is often considered large if


4
Di > .
n

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 68 / 92
Residuals vs. Leverage Plot
The spread of standardized residuals shouldn’t change as a function of leverage. This
suggests homoscedasticity.
Not all outliers are influential in linear regression analysis. Even though data have extreme
values, they might not be influential to determine a regression line. In this graph, Points
with high leverage may be influential.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 69 / 92
Assessing Normality Using Shapiro-Wilk Test

# Shapiro-Wilk test of normality


shapiro.test(lm.fit$residuals)

##
## Shapiro-Wilk normality test
##
## data: lm.fit$residuals
## W = 0.91767, p-value = 3.939e-09

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 70 / 92
Assessing Homoscedasticity Using Breusch-Pagan
Test
H0 : Equal/constant variances
HA : Unequal/non-constant variances

# Breusch-Pagan test
car::ncvTest(lm.fit)

## Non-constant Variance Score Test


## Variance formula: ~ fitted.values
## Chisquare = 5.355982, Df = 1, p = 0.020651

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 71 / 92
Assessing Independence Using Durbin-Watson Test
H0 : There is no correlation among the residuals.
HA : The residuals are autocorrelated.

# Durbin-Watson test
car::durbinWatsonTest(lm.fit)

## lag Autocorrelation D-W Statistic p-value


## 1 -0.04687792 2.083648 0.482
## Alternative hypothesis: rho != 0

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 72 / 92
Wage Dataset
The Current Population Survey (CPS) is used to supplement census information between
census years.
These data consist of a random sample of 534 persons from the CPS, with information on
wages and other characteristics of the workers.
We wish to determine

1 whether wages are related to these characteristics and


2 whether there is a gender gap in wages.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 73 / 92
Wage Dataset - Variables

wage: Wage (in dollars per hour).


education: Number of years of education.
experience: Number of years of potential work experience
age: Age in years.
ethnicity: Factor with levels “cauc”, “hispanic”, “other”.
region: Factor. Does the individual live in the South?
gender: Factor indicating gender.
occupation: Factor with levels “worker” (tradesperson or assembly line worker),
“technical” (technical or professional worker), “services” (service worker), “office”
(office and clerical worker), “sales” (sales worker), “management” (management and
administration).
sector: Factor with levels “manufacturing” (manufacturing or mining),
“construction”, “other”.
union: Factor. Does the individual work on a union job?
married: Factor. Is the individual married?

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 74 / 92
Wage Data

library(AER)
data("CPS1985")
glimpse(CPS1985)

## Rows: 534
## Columns: 11
## $ wage <dbl> 5.10, 4.95, 6.67, 4.00, 7.50, 13.07, 4.45, 19.47, 13.28, 8.~
## $ education <dbl> 8, 9, 12, 12, 12, 13, 10, 12, 16, 12, 12, 12, 8, 9, 9, 12, ~
## $ experience <dbl> 21, 42, 1, 4, 17, 9, 27, 9, 11, 9, 17, 19, 27, 30, 29, 37, ~
## $ age <dbl> 35, 57, 19, 22, 35, 28, 43, 27, 33, 27, 35, 37, 41, 45, 44,~
## $ ethnicity <fct> hispanic, cauc, cauc, cauc, cauc, cauc, cauc, cauc, cauc, c~
## $ region <fct> other, other, other, other, other, other, south, other, oth~
## $ gender <fct> female, female, male, male, male, male, male, male, male, m~
## $ occupation <fct> worker, worker, worker, worker, worker, worker, worker, wor~
## $ sector <fct> manufacturing, manufacturing, manufacturing, other, other, ~
## $ union <fct> no, no, no, no, no, yes, no, no, no, no, yes, yes, no, yes,~
## $ married <fct> yes, yes, no, no, yes, no, no, no, yes, no, yes, no, yes, n~

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 75 / 92
Wage Data - Variable Summary
summary(CPS1985)

## wage education experience age


## Min. : 1.000 Min. : 2.00 Min. : 0.00 Min. :18.00
## 1st Qu.: 5.250 1st Qu.:12.00 1st Qu.: 8.00 1st Qu.:28.00
## Median : 7.780 Median :12.00 Median :15.00 Median :35.00
## Mean : 9.024 Mean :13.02 Mean :17.82 Mean :36.83
## 3rd Qu.:11.250 3rd Qu.:15.00 3rd Qu.:26.00 3rd Qu.:44.00
## Max. :44.500 Max. :18.00 Max. :55.00 Max. :64.00
## ethnicity region gender occupation sector
## cauc :440 south:156 male :289 worker :156 manufacturing: 99
## hispanic: 27 other:378 female:245 technical :105 construction : 24
## other : 67 services : 83 other :411
## office : 97
## sales : 38
## management: 55
## union married
## no :438 no :184
## yes: 96 yes:350
##
##
##
##
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 76 / 92
Predictors with Only Two Levels
If a qualitative predictor (also known as a factor) only has two levels, or possible values,
then incorporating it into a regression model is very simple. We simply create an indicator
or dummy variable that takes on two possible numerical values. For example, based on
the gender, we can create variable a new variable that takes the form

0, if male
xi =
1, if female

and use this variable as a predictor in the regression equation. This results in the model

β0 + ϵi , if male
log(wage)i = β0 + β1 xi + ϵi =
β0 + β1 + ϵi , if female

Now β0 can be interpreted as the average log wage when male, β1 as the average
difference in log wages between female and male.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 77 / 92
Model 1
mod1 <- lm(log(wage) ~ gender, data = CPS1985)
summary(mod1)

##
## Call:
## lm(formula = log(wage) ~ gender, data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16529 -0.37589 0.00662 0.36855 1.86145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.16529 0.03032 71.411 < 2e-16 ***
## genderfemale -0.23125 0.04477 -5.166 3.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5155 on 532 degrees of freedom
## Multiple R-squared: 0.04776, Adjusted R-squared: 0.04597
## F-statistic: 26.69 on 1 and 532 DF, p-value: 3.39e-07

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 78 / 92
Qualitative Predictors with More than Two Levels
When a qualitative predictor has more than two levels, a single dummy variable cannot
represent all possible values. In this situation, we can create additional dummy variables.
For example, for the ethnicity variable, we two dummy variables.

levels(CPS1985$ethnicity)

## [1] "cauc" "hispanic" "other"


1, if hispanic
xi1 =
0, if not hispanic

1, if other
xi2 =
0, if caucasian or hispanic (not other)

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 79 / 92
Qualitative Predictors with More than Two Levels
We will use these variables as predictors in the regression equation. This results in the
model 
β0 + ϵi , if caucasian
yi = β0 + β1 xi1 + β2 xi2 + ϵi = β0 + β1 + ϵi , if hispanic

β0 + β2 + ϵi , if other

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 80 / 92
Model 2
mod2 <- lm(log(wage) ~ ethnicity, data = CPS1985)
summary(mod2)

##
## Call:
## lm(formula = log(wage) ~ ethnicity, data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08810 -0.38335 -0.00865 0.36526 1.70739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.08810 0.02499 83.549 < 2e-16 ***
## ethnicityhispanic -0.26955 0.10394 -2.593 0.00977 **
## ethnicityother -0.12177 0.06875 -1.771 0.07710 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5242 on 531 degrees of freedom
## Multiple R-squared: 0.0169, Adjusted R-squared: 0.0132
## F-statistic: 4.565 on 2 and 531 DF, p-value: 0.01083
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 81 / 92
Model 3
mod3 <- lm(log(wage) ~ education+experience+age+gender, data = CPS1985)
summary(mod3)

##
## Call:
## lm(formula = log(wage) ~ education + experience + age + gender,
## data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15564 -0.30705 0.00479 0.30833 1.99432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.15357 0.69387 1.663 0.097 .
## education 0.17746 0.11371 1.561 0.119
## experience 0.09234 0.11375 0.812 0.417
## age -0.07961 0.11365 -0.700 0.484
## genderfemale -0.25736 0.03948 -6.519 1.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4525 on 529 degrees of freedom
## Multiple R-squared: 0.2703, Adjusted R-squared: 0.2648
## F-statistic: 48.99 on 4 and 529 DF, p-value: < 2.2e-16

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 82 / 92
Checking Collinearity
library(GGally)
ggpairs(CPS1985 %>% dplyr::select(education,experience,age,gender))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

education experience age gender


0.3

education
0.2
Corr: Corr:
−0.353*** −0.150***
0.1

0.0

40

experience
Corr:
20
0.978***

60

50

age
40

30

20

90
60
30

gender
0
90
60
30
0
5 10 15 0 20 40 20 30 40 50 60 male female

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 83 / 92
Variance Inflation Factor (VIF) and Tolerance
It is a measure of how much the variance of the estimated regression coefficient βk is
“inflated” by the existence of correlation among the predictor variables in the model.
A VIF of 1 means that there is no correlation among the k-th predictor and the remaining
predictor variables, and hence the variance of βk is not inflated at all.
The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while
VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
Tolerance is percent of variance in the predictor that cannot be accounted for by other
predictors.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 84 / 92
Variance Inflation Factor (VIF) and Tolerance

Regress the k-th predictor on rest of the predictors in the model.


Compute the Rk2
1 1
VIF = =
1 − Rk2 Tolerance

library(mctest)
vif(mod3)

## education experience age gender


## 230.224482 5162.059937 4623.490407 1.009274

1/vif(mod3) #tolerance

## education experience age gender


## 0.0043435867 0.0001937211 0.0002162868 0.9908111528

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 85 / 92
Farrar - Glauber Test (Overall Collinearity Check)

omcdiag(mod3)

##
## Call:
## omcdiag(mod = mod3)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.0002 1
## Farrar Chi-Square: 4553.6699 1
## Red Indicator: 0.4311 0
## Sum of Lambda Inverse: 10016.7841 1
## Theil's Method: 2.1935 1
## Condition Number: 556.6117 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 86 / 92
Farrar - Glauber Test (Overall Collinearity Check)
The standardized determinant, |X ′ X | is found to be 0.0002 which is very small.
Farrar - Glauber Test

H0 : Rx21 x2 ...xk = 0 vs. H0 : Rx21 x2 ...xk ̸= 0

The χ2 test statistic is 4553.6699 and it is highly significant thereby implying the presence
of multicollinearity in the model specification.
This motivates us to go for the next step of Farrar - Glauber test (F -test) for the location
of the multicollinearity.

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 87 / 92
Farrar - Glauber Test (Individual Collinearity Check)
imcdiag(mod3)

##
## Call:
## imcdiag(mod = mod3)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein
## education 230.2245 0.0043 40496.3252 60859.1001 0.0659 222.0259 1
## experience 5162.0599 0.0002 911787.2555 1370261.4132 0.0139 4978.2320 1
## age 4623.4904 0.0002 816639.9719 1227271.2031 0.0147 4458.8417 1
## genderfemale 1.0093 0.9908 1.6384 2.4623 0.9954 0.9733 0
## IND1 IND2
## education 0.0000 1.3256
## experience 0.0000 1.3311
## age 0.0000 1.3311
## genderfemale 0.0056 0.0122
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## education , experience , age , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.2703
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 88 / 92
Farrar - Glauber Test (Individual Collinearity Check)
The above output shows that Education, Experience and Age have multicollinearity. Also,
the VIF value is very high for these variables. Finally, let’s move to examine the pattern of
multicollinearity and conduct t-test for correlation coefficients.

education<-CPS1985[,2]
experience<-CPS1985[,3]
age<-CPS1985[,4]
gender<-CPS1985[,7]
x<- cbind(education,experience,age,gender)

library(ppcor)
pcor(x)

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 89 / 92
t-test for correlation coefficients
## $estimate
## education experience age gender
## education 1.00000000 -0.99777529 0.99751429 0.05316235
## experience -0.99777529 1.00000000 0.99988864 0.05233923
## age 0.99751429 0.99988864 1.00000000 -0.05113531
## gender 0.05316235 0.05233923 -0.05113531 1.00000000
##
## $p.value
## education experience age gender
## education 0.0000000 0.000000 0.0000000 0.2208851
## experience 0.0000000 0.000000 0.0000000 0.2281270
## age 0.0000000 0.000000 0.0000000 0.2390205
## gender 0.2208851 0.228127 0.2390205 0.0000000
##
## $statistic
## education experience age gender
## education 0.000000 -344.556354 325.901918 1.225622
## experience -344.556354 0.000000 1542.494566 1.206593
## age 325.901918 1542.494566 0.000000 -1.178765
## gender 1.225622 1.206593 -1.178765 0.000000
##
## $n
## [1] 534
##
## $gp
## [1] 2
##
## $method
## [1] "pearson"

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 90 / 92
Findings & Remedial Measures
As expected the high partial correlation between ‘age’ and ‘experience’ is found to be
highly statistically significant. Similar is the case for ‘education – experience’ and
‘education – age’ .
As a remedial measure, we can build a model by excluding ‘experience’, estimate the
model and go for further diagnosis for the presence of multicollinearity.

mod4 <- lm(log(wage) ~ education+age+gender, data = CPS1985)


vif(mod4)

## education age gender


## 1.023228 1.029679 1.006509

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 91 / 92
Model 4: Result

Table 1: Model 4 (Excluding experience variable)

Dependent variable:
log(wage)
education 0.085∗∗∗ (0.008)
age 0.013∗∗∗ (0.002)
genderfemale −0.256∗∗∗ (0.039)
Constant 0.600∗∗∗ (0.126)
Observations 534
R2 0.269
Adjusted R2 0.265
Residual Std. Error 0.452 (df = 530)
F Statistic 65.141∗∗∗ (df = 3; 530)
∗ ∗∗ ∗∗∗
Note: p<0.1; p<0.05; p<0.01

Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 92 / 92

You might also like