Regression Analysis 2022
Regression Analysis 2022
Regression Analysis 2022
01 November 2022
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 1 / 92
Variance
Variance of a single variable represents the average amount that the data vary from the
mean.
(xi − x̄ )2
P
2
s =
n−1
The mean of the sample is represented by x̄ , xi is the data point in question and n is the
number of observations.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 2 / 92
Covariance
If there were a relationship between these two variables, then as one variable deviates
from its mean, the other variable should deviate from its mean in the same or the directly
opposite way.
Calculating the covariance is a good way to assess whether two variables are related to
each other.
P
(xi − x̄ )(yi − ȳ )
cov (x , y ) =
n−1
A positive covariance indicates that as one variable deviates from the mean, the other
variable deviates in the same direction.
On the other hand, a negative covariance indicates that as one variable deviates from the
mean (e.g., increases), the other deviates from the mean in the opposite direction (e.g.,
decreases).
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 3 / 92
R Code
We will use adverising dataset (Advertising.csv).
## [1] 350.3902
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 4 / 92
Standardization & Correlation Coefficient
Covariance depends upon the scales of measurement used. So, covariance is not a
standardized measure.
This dependence on the scale of measurement is a problem because it means that we
cannot compare covariances in an objective way – so, we cannot say whether a covariance
is particularly large or small relative to another data set unless both data sets were
measured in the same units.
The standardized covariance is known as a correlation coefficient and is defined by
P
cov (x , y ) (xi − x̄ )(yi − ȳ )
r= =
sx sy (n − 1)sx sy
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 5 / 92
Interpretation
Coefficient of +1 indicates that the two variables are perfectly positively correlated, so as
one variable increases, the other increases by a proportionate amount.
Conversely, a coefficient of -1 indicates a perfect negative relationship: if one variable
increases, the other decreases by a proportionate amount.
A coefficient of zero indicates no linear relationship at all and so if one variable changes,
the other stays the same.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 6 / 92
R Code
cor(data$TV, data$sales)
## [1] 0.7822244
print(cor.matrix)
In any correlation, causality between two variables cannot be assumed because there
may be other measured or unmeasured variables affecting the results.
Even if we could ignore the third-variable problem described above, and we could
assume that the two correlated variables were the only important ones, the
correlation coefficient doesn’t indicate in which direction causality operates.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 8 / 92
Coefficient of Determination (R 2 )
The correlation coefficient squared (known as the coefficient of determination, R 2 ) is a
measure of the amount of variability in one variable that is shared by the other.
cor.matrixˆ2
If we convert this value into a percentage (multiply by 100) we can say that TV
expenditures shares 61.2% of the variability in sales.
So, although TV expenditures was highly correlated with sales, it can account for
only 61.2% of variation in sales. To put this value into perspective, this leaves
38.8% of the variability still to be accounted for by other variables.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 9 / 92
Pearson Product-moment Correlation Coefficient
Assumptions
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 10 / 92
Pearson’s Correlation Coefficient Test
H0 : Population correlation coefficient (r ) is not significantly different from 0
HA : Population correlation coefficient (r ) is significantly different from 0
t test statistic: √
r
t= √ n−2
1−r 2
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 11 / 92
R Code
##
## Pearson's product-moment correlation
##
## data: data$TV and data$sales
## t = 17.668, df = 198, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7218201 0.8308014
## sample estimates:
## cor
## 0.7822244
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 12 / 92
Checking Normality Assumption
H0 : Variable follows a normal distribution
H1 : Variable follows a distribution significantly different from normal distribution
p-value greater than α (say, 0.05) indicates variable follows normal distribution.
shapiro.test(data$TV)
##
## Shapiro-Wilk normality test
##
## data: data$TV
## W = 0.94951, p-value = 1.693e-06
shapiro.test(data$sales)
##
## Shapiro-Wilk normality test
##
## data: data$sales
## W = 0.97603, p-value = 0.001683
We may conclude that both TV and sales are significantly different from normal distribution.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 13 / 92
Spearman’s correlation coefficient
Spearman’s correlation coefficient, ρ, is a non-parametric statistic and so can be used
when the data have violated parametric assumptions such as non-normally distributed
data.
Spearman’s test works by first ranking the data, and then applying Pearson’s equation to
those ranks.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 14 / 92
R Code
## [1] 0.8006144
##
## Spearman's rank correlation rho
##
## data: data$TV and data$sales
## S = 265841, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8006144
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 15 / 92
Kendall’s Tau (τ )
Kendall’s tau, τ , is another non-parametric correlation and it should be used rather than
Spearman’s coefficient when you have a small data set with a large number of tied ranks.
This means that if you rank all of the scores and many scores have the same rank, then
Kendall’s τ should be used.
Although Spearman’s statistic is the more popular of the two coefficients, there is much
to suggest that Kendall’s statistic is actually a better estimate of the correlation in the
population.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 16 / 92
R Code
## [1] 0.6219464
##
## Kendall's rank correlation tau
##
## data: data$TV and data$sales
## z = 13.041, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6219464
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 17 / 92
Biserial and point-biserial correlations
These correlation coefficients are used when one of the two variables is dichotomous (i.e.,
it is categorical with only two categories) and other variable is continuous (roughly follows
normal distribution).
Point-biserial correlation coefficient (rpb ) is used when one variable is a discrete dichotomy
(e.g., gender).
The biserial correlation coefficient (rb ) is used when one variable is a continuous
dichotomy (e.g., passing or failing an exam).
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 18 / 92
R Code - Point Biserial Correlation
salary <- c(10.3, 18.4, 12.2, 14.6, 10.0, 22.3, 25.8, 15.1, 19.3, 18.4, 10.9)
gender <- c("M", "M", "F", "F", "M", "F", "F", "F", "F", "M", "M")
gender.fac <- factor(gender)
gender.fac
## [1] M M F F M F F F F M M
## Levels: F M
## [1] 2 2 1 1 2 1 1 1 1 2 2
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 19 / 92
R Code - Point Biserial Correlation
cor.test(salary,gender.num)
##
## Pearson's product-moment correlation
##
## data: salary and gender.num
## t = -1.5725, df = 9, p-value = 0.1503
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8323297 0.1879700
## sample estimates:
## cor
## -0.4642536
At 5% level of significance salary and gender does not have significant correlation (rb = −0.464,
p > 0.05).
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 20 / 92
R Code - Biserial Correlation
## [1] "Low" "High" "High" "High" "Low" "High" "Low" "High" "High" "Low"
## [11] "High" "Low"
library(polycor)
polyserial(spending, sal_group)
## [1] 0.2252382
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 21 / 92
Polychoric Correlation
Polychoric correlation is a technique for estimating the correlation between two observed ordinal
variables.
Take 20 customers who have used the product and have them rate their overall likeability on a
scale of 1 to 5, where:
1 denotes “dislike a lot,” 2 denotes “dislike a little,” 3 denotes “neither like nor dislike,” 4 denotes
“like it a little,” and 5 denotes “like it a lot.”
library(polycor)
pr1 <- c(5, 5, 4, 2, 2, 3, 3, 5, 2, 5, 3, 5, 4, 5, 4, 4, 5, 4, 5, 5)
pr2 <- c(2, 2, 3, 4, 5, 3, 4, 2, 1, 4, 1, 5, 2, 2, 5, 5, 1, 5, 1, 2)
polychor(pr1, pr2)
## [1] -0.225602
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 22 / 92
Cramer’s V
Cramer’s V (sometimes referred to as Cramer’s ϕ and denoted as ϕc ) is a measure of association
between two nominal variables, giving a value between 0 and +1 (inclusive).
library(rcompanion)
x<- c(3, 1, 1, 3, 2, 1, 3, 2, 2, 1, 1, 3, 1, 3, 3, 2, 3, 3, 3, 3)
y<- c(2, 2, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 2, 1, 2, 1, 1, 1, 2, 1)
mat <- table(x,y)
colnames(mat) <- c("No", "Yes") # Having caredit card
rownames(mat) <- c("Faculty", "Student", "Staff")
mat
## y
## x No Yes
## Faculty 3 3
## Student 3 1
## Staff 5 5
cramerV(mat)
## Cramer V
## 0.201
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 23 / 92
Partial Correlation
A correlation between two variables in which the effects of other variables are held constant is
known as a partial correlation.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 24 / 92
Partial Correlation
Exam performance was negatively related to exam anxiety.
Exam performance was positively related to revision time.
Revision time itself was negatively related to exam anxiety.
This scenario is complex, but given that we know that revision time is related to both exam
anxiety and exam performance, then if we want a pure measure of the relationship between exam
anxiety and exam performance we need to take account of the influence of revision time.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 25 / 92
Partial Correlation
Exam anxiety accounts for 19.4% of the variance in exam performance, that revision time
accounts for 15.7% of the variance in exam performance, and that revision time accounts for
50.2% of the variance in exam anxiety.
If revision time accounts for half of the variance in exam anxiety, then it seems feasible that at
least some of the 19.4% of variance in exam performance that is accounted for by anxiety is the
same variance that is accounted for by revision time.
As such, some of the variance in exam performance explained by exam anxiety is not unique and
can be accounted for by revision time.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 26 / 92
R Code
library(ppcor)
pc<-pcor(ex[, c("Exam", "Anxiety", "Revise")])
pc
## $estimate
## Exam Anxiety Revise
## Exam 1.0000000 -0.2466658 0.1326783
## Anxiety -0.2466658 1.0000000 -0.6485301
## Revise 0.1326783 -0.6485301 1.0000000
##
## $p.value
## Exam Anxiety Revise
## Exam 0.00000000 1.244581e-02 1.837308e-01
## Anxiety 0.01244581 0.000000e+00 1.708019e-13
## Revise 0.18373076 1.708019e-13 0.000000e+00
##
## $statistic
## Exam Anxiety Revise
## Exam 0.000000 -2.545307 1.338617
## Anxiety -2.545307 0.000000 -8.519961
## Revise 1.338617 -8.519961 0.000000
##
## $n Data Analysis, AY 2022-23
Advanced Correlation & Regression Analysis 01 November 2022 27 / 92
R Code
print(pc$estimateˆ2*100)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 28 / 92
Interpretation
The partial correlation between exam performance and exam anxiety is -.247, which is
considerably less than the correlation when the effect of revision time is not controlled for
(r = −.441).
This correlation is still statistically significant (its p-value is .012, which is still below .05).
In terms of variance, the value of R 2 for the partial correlation is .06, which means that
exam anxiety can now account for only 6% of the variance in exam performance. When the
effects of revision time were not controlled for, exam anxiety shared 19.4% of the variation
in exam scores and so the inclusion of revision time has severely diminished the amount of
variation in exam scores shared by anxiety.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 29 / 92
Visualizing Correlations - Scatterplot
library("ggpubr")
ggscatter(data, x = "TV", y = "sales",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "TV Expenditures", ylab = "sales")
20
sales
10
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 30 / 92
Visualizing Correlations - Scatterplot Matrix
library("PerformanceAnalytics")
chart.Correlation(data, histogram=TRUE, pch=19)
0 10 20 30 40 50 5 10 15 20 25
***
TV
250
0.78
Density
150
0.055 0.057
50
0
50
*** ***
radio
40
0.58
30
Density
0.35
20
10
0
newspaper
**
80
x Density
0.23
60
40
20
0
25
sales
20
Density
15
10
5
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 31 / 92
Visualizing Correlations - Correlogram
library(corrplot)
corrplot(cor.matrix, is.corr = FALSE, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
er
ap
sp
s
io
w
le
ne
TV
sa
ra
1
0.91
TV
0.81
0.72
sales
0.62
0.53
0.43
radio
0.34
0.24
newspaper
0.15
0.05
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 32 / 92
Regression Framework
The regression framework can be characterized in the following way:
We have one target variable (also response or dependent variable), y , that we are
interested in understanding or modeling, such as sales of a particular product, sale
price of a home etc.
We have a set of p predictor or independent variables, x1 , x2 , . . . , xp that we think
might be useful in predicting or modeling the target variable (the price of the
product, the competitor’s price, and so on).
Typically, a regression analysis is used for one (or more) of three purposes
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 33 / 92
The Linear Regression Model
The data consist of n sets of observations {x1 , x2 , . . . , xp , y }, which represent a random
sample from a larger population. It is assumed that these observations satisfy a linear
relationship,
where the coefficients, β are unknown parameters, and the ϵi are random error terms.
The special case of with p = 1 corresponds to the simple regression model.
A primary goal of a regression analysis is to estimate this relationship, or equivalently, to
estimate the unknown parameters β.
The standard approach is least square regression.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 34 / 92
Fitted Value and Residual
Using least square techniques, the unknown coefficients β are chosen by minimizing the
sum of squared errors:
n
X
ϵ2i = [yi − (β0 + β1 x1i + β2 x2i + . . . + βp xpi )]2
i=1
For any choice of estimated parameters β̂, the estimated expected response value given
the observed predictor values equals
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 35 / 92
Estimating β using Least Squares
Define the following matrix and vectors as follows:
(X′ X)β̂ = X′ y.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 36 / 92
Estimating β using Least Squares
Therefore, the least squares estimates satisfy
β̂ = (X′ X)−1 X′ y.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 37 / 92
R Code
We will use adverising dataset (Advertising.csv).
## [,1]
## 2.938889369
## TV 0.045764645
## radio 0.188530017
## newspaper -0.001037493
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 38 / 92
R Code
## [1] 200 1
## V1
## Min. :-8.8277
## 1st Qu.:-0.8908
## Median : 0.2418
## Mean : 0.0000
## 3rd Qu.: 1.1893
## Max. : 2.8292
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 39 / 92
Assumptions
1 The expected value of the errors is zero (E (ϵi ) = 0 for all i).
2 The variance of the errors is constant (V (ϵi ) = σ 2 for all i). This assumption of
constant variance is called homoscedasticity, and its violation (nonconstant
variance) is called heteroscedasticity.
3 The errors are uncorrelated with each other.This violation most often occurs in data
that are ordered in time (time series data), where errors that are near each other in
time are often similar to each other (such time-related correlation is called
autocorrelation).
4 The errors are normally distributed.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 40 / 92
Interpreting Regression Coefficients
β̂0 : The estimated expected value of the target variable when the predictors all equal zero.
The estimated coefficient for the jth predictor (j = 1, . . . , p) is interpreted in the following
way.
β̂j : The estimated expected change in the target variable associated with a one unit
change in the j-th predicting variable, holding all else in the model fixed.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 41 / 92
Sum of Squares
Pn
Variability in the target variable, termed as corrected sum of squares: i=1
(yi − Ȳ )2 .
The variability
Pn left over after doing the regression, termed as residual sum of squares
(RSS): i=1
(yi − ŷ )2 .
Variability
Pn accounted for by doing the regression, termed as regression sum of squares:
i=1
(ŷ − Ȳ )2 .
The least squares estimates possess an important property:
Corrected sum of squares = Residual sum of squares + Regression sum of squares
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 42 / 92
R Code
print(css)
## [1] 5417.149
print(rss+regss)
## [1] 5417.149
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 43 / 92
Measuring the Strength of Regression Relationship
A measure of the strength of the regression relationship can be the ratio of variation
explained by the model to the total variation, i.e,
Pn
(ŷ − Ȳ )2 Regression SS Residual SS
Pni=1 ≡ ≡1−
i=1
(yi − Ȳ )
2 Corrected SS Corrected SS
regss/css
## [1] 0.8972106
cor(data$sales, fitted)ˆ2
## [,1]
## [1,] 0.8972106
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 44 / 92
Adjusted R 2
It can be shown that R 2 is biased upwards as an estimate of the population proportion of
variability accounted for by the regression. The adjusted R 2 corrects this bias, and equals
p
Ra2 = R 2 − (1 − R 2 )
n−p−1
Unless p is large relative to n − p − 1 (that is, unless the number of predictors is large
relative to the sample size), R 2 and Ra2 will be close to each other, and the choice of
which to use is a minor concern.
R 2 provides an explicit tradeoff between the strength of the fit (the first term, with larger
R 2 corresponding to stronger fit and larger Ra2 ) and the complexity of the model (the
second term, with larger p corresponding to more complexity and smaller Ra2 ).
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 45 / 92
R Code
n <- nrow(y)
p <- ncol(X) - 1 # One less due to column of ones
r2 <- regss/css
adj_r2 <- r2 - (p/(n-p-1))*(1-r2)
print(adj_r2)
## [1] 0.8956373
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 46 / 92
Variance of the Errors σ 2
An unbiased estimate is provided by the residual mean square,
Pn
(yi − ŷ )2
σ̂ 2 = i=1
n−p−1
Recall that the model assumes that the errors are normally distributed with standard
deviation σ. This means that, roughly speaking, 95% of the time an observed y value falls
within ±2σ of the expected response
E (y ) = β0 + β1 x1 + β2 x2 + . . . + βp xp .
while the square root of the residual mean square, termed the residual standard error of
the estimate, provides an estimate of a that can be used in constructing this rough
prediction interval ±2σ̂.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 47 / 92
R Code
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 48 / 92
Hypothesis Tests for β
There are two types of hypothesis tests related to the regression coefficients of immediate
interest.
1 Do any of the predictors provide predictive power for the target variable?
2 Given the other variables in the model, does a particular predictor provide additional
predictive power?
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 49 / 92
Test of the Overall Significance of the Regression
H0 : β1 = . . . = βp = 0
versus
HA : Some βj = 0, j = 1, . . . , p.
The test of these hypotheses is the F -test,
Regression MS Regression SS/p
F = =
Residual MS Residual SS/(n − p − 1)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 50 / 92
R Code
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 51 / 92
Test of the Significance of an Individual Coefficient
H0 : βj = 0, j = 1, . . . , p,
versus
HA : βj ̸= 0.
This is tested using a t-test,
β̂j
tj =
s.e.(
c β̂j )
which is compared to a t-distribution on n − p − 1 degrees of freedom. The values of
.e(β̂j ) are obtained as the square roots of the diagonal elements of V̂ (β̂) = (X′ X)−1 σ̂ 2 ,
sc
where σ̂ 2 is the residual mean square.
A t-test for the intercept also can be constructed, although this does not refer to a
hypothesis about a predictor, but rather about whether the expected target is equal to a
specified value if all of the predictors equal zero.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 52 / 92
R Code
## se.beta
## 9.4222884 0.311908236 1.267295e-17
## TV 32.8086244 0.001394897 1.509960e-81
## radio 21.8934961 0.008611234 1.505339e-54
## newspaper -0.1767146 0.005871010 8.599151e-01
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 53 / 92
Confidence Intervals for β
A confidence interval provides an alternative way of summarizing the degree of precision in
the estimate of a regression parameter.
A 100x (1 − α)% confidence interval for βj has the form
where tα/2,n−p−1 is the appropriate critical value at two-sided level α for a t-distribution
on n − p − 1 degrees of freedom.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 54 / 92
R Code
## [,1] [,2]
## 2.32376228 3.55401646
## TV 0.04301371 0.04851558
## radio 0.17154745 0.20551259
## newspaper -0.01261595 0.01054097
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 55 / 92
Constructing Prediction Intervals
This interval provides guidance as to how precise ŷ0 is as a prediction of y for some
particular specified value x0 , where ŷ0 is determined by substituting the values x0 into the
estimated regression equation.
The prediction interval is then
c 0P )
ŷ0 ± tα/2,n−p−1 s.e.(ŷ
where, hp i
c 0P ) =
s.e.(ŷ 1 + x0′ (X′ X)−1 x0 σ̂.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 56 / 92
R Code
## [,1]
## [1,] 1.701056
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 57 / 92
Confidence Interval for a Fitted Value
The prediction interval is used to provide an interval estimate for a prediction of y for one
member of the population with a particular value of x0 .
The confidence interval is used to provide an interval estimate for the true expected value
of y for all members of the population with a particular value of x0 .
The confidence interval for a fitted value is then
c 0F )
ŷ0 ± tα/2,n−p−1 s.e.(ŷ
where, hp i
c 0F ) =
s.e.(ŷ x0′ (X′ X)−1 x0 σ̂.
The confidence interval for a fitted value will always be narrower than the prediction
interval (due to the absense of the extra σ̂ 2 in the equation).
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 58 / 92
R Code
## [,1]
## [1,] 0.2294476
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 59 / 92
Fitting a Linear Regression Model in R
lm.fit <- lm(sales ~ ., data = data)
summary(lm.fit)
##
## Call:
## lm(formula = sales ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8277 -0.8908 0.2418 1.1893 2.8292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.938889 0.311908 9.422 <2e-16 ***
## TV 0.045765 0.001395 32.809 <2e-16 ***
## radio 0.188530 0.008611 21.893 <2e-16 ***
## newspaper -0.001037 0.005871 -0.177 0.86
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared: 0.8972, Adjusted R-squared: 0.8956
## F-statistic: 570.3 on 3 and 196 DF, p-value: < 2.2e-16
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 60 / 92
Related R Functions
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 61 / 92
Diagnostic Plots
1 Residuals vs fitted
2 Normal Q-Q
3 Scale-location
4 Cook’s distance
5 Residuals vs. leverage
6 Cook’s distance vs leverage
R function used:
library(ggfortify)
autoplot(lm.fit, which = 1:6, ncol = 3, label.size = 3)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 62 / 92
Diagnostics Plots
Residuals vs Fitted Normal Q−Q Scale−Location
2
3 131
2.0
Standardized residuals
Standardized residuals
0 0 6
179
1.5
Residuals
−3
−2
179 179 1.0
6 6
−6
−4 0.5
−9 131 131
0.0
5 10 15 20 25 −3 −2 −1 0 1 2 3 5 10 15 20 25
Fitted values Theoretical Quantiles Fitted values
0
0.2 0.2
Cook's distance
Cook's distance
6 −2 76 6
0.1 0.1
6
76 −4 76
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 63 / 92
Residuals vs Fitted Plot
A flat and horizontal zero (dashed) line no distinctive pattern in the plot suggest
that the assumption that the relationship is linear is reasonable.
If the residuals are close to 0 for small fitted values and are more spread out for
large fitted values (fanning effect) or the residuals are spread out for small fitted
values and close to 0 for large fitted values (funneling effect), or varies in some
complex fashion, suggest non-constant error variance.
No single residual stands apart from the basic random pattern of residuals suggests
that there are no outliers.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 64 / 92
Standardized Residuals & Leverages
Standardized residuals for each observations i = 1, 2, . . . , n can be expressed as
ei
ri = √ ,
se i − hi
where, hi are the diagonals elements of H matrix also known as leverages and it can be
shown that, when n is large,
ri ∼ N(0, 1).
Large values of hi indicate extreme values in X, which may influence regression. Note that
leverages only depend on X. .
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 65 / 92
Normal Q-Q Plot
We can check normality assumption of of errors using Q-Q plot. Ideally all the points
should fall approximately along this 45-degree reference (dashed) line. The greater the
departure from this reference line, the greater the the chance of being non-normal errors.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 66 / 92
Scale-location Plot
From the residuals vs fitted plot, it is difficult to identify an outlier (residuals are not
standardized).
In scale-location plot, points would be flagged as outliers if standardized residuals are
greater +2 or smaller than -2.
To meet the constant error variance assumption, Then the average magnitude of the
standardized residuals (blue line) should be horizontal and the points should randomly
placed around the blue line.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 67 / 92
Influential Points & Cook’s Distance
Observations that have high leverage and large residual are known as influential points.
A common measure of influence is Cook’s Distance, which is defined as
1 2 hi
Di = ri .
p 1 − hi
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 68 / 92
Residuals vs. Leverage Plot
The spread of standardized residuals shouldn’t change as a function of leverage. This
suggests homoscedasticity.
Not all outliers are influential in linear regression analysis. Even though data have extreme
values, they might not be influential to determine a regression line. In this graph, Points
with high leverage may be influential.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 69 / 92
Assessing Normality Using Shapiro-Wilk Test
##
## Shapiro-Wilk normality test
##
## data: lm.fit$residuals
## W = 0.91767, p-value = 3.939e-09
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 70 / 92
Assessing Homoscedasticity Using Breusch-Pagan
Test
H0 : Equal/constant variances
HA : Unequal/non-constant variances
# Breusch-Pagan test
car::ncvTest(lm.fit)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 71 / 92
Assessing Independence Using Durbin-Watson Test
H0 : There is no correlation among the residuals.
HA : The residuals are autocorrelated.
# Durbin-Watson test
car::durbinWatsonTest(lm.fit)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 72 / 92
Wage Dataset
The Current Population Survey (CPS) is used to supplement census information between
census years.
These data consist of a random sample of 534 persons from the CPS, with information on
wages and other characteristics of the workers.
We wish to determine
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 73 / 92
Wage Dataset - Variables
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 74 / 92
Wage Data
library(AER)
data("CPS1985")
glimpse(CPS1985)
## Rows: 534
## Columns: 11
## $ wage <dbl> 5.10, 4.95, 6.67, 4.00, 7.50, 13.07, 4.45, 19.47, 13.28, 8.~
## $ education <dbl> 8, 9, 12, 12, 12, 13, 10, 12, 16, 12, 12, 12, 8, 9, 9, 12, ~
## $ experience <dbl> 21, 42, 1, 4, 17, 9, 27, 9, 11, 9, 17, 19, 27, 30, 29, 37, ~
## $ age <dbl> 35, 57, 19, 22, 35, 28, 43, 27, 33, 27, 35, 37, 41, 45, 44,~
## $ ethnicity <fct> hispanic, cauc, cauc, cauc, cauc, cauc, cauc, cauc, cauc, c~
## $ region <fct> other, other, other, other, other, other, south, other, oth~
## $ gender <fct> female, female, male, male, male, male, male, male, male, m~
## $ occupation <fct> worker, worker, worker, worker, worker, worker, worker, wor~
## $ sector <fct> manufacturing, manufacturing, manufacturing, other, other, ~
## $ union <fct> no, no, no, no, no, yes, no, no, no, no, yes, yes, no, yes,~
## $ married <fct> yes, yes, no, no, yes, no, no, no, yes, no, yes, no, yes, n~
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 75 / 92
Wage Data - Variable Summary
summary(CPS1985)
and use this variable as a predictor in the regression equation. This results in the model
β0 + ϵi , if male
log(wage)i = β0 + β1 xi + ϵi =
β0 + β1 + ϵi , if female
Now β0 can be interpreted as the average log wage when male, β1 as the average
difference in log wages between female and male.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 77 / 92
Model 1
mod1 <- lm(log(wage) ~ gender, data = CPS1985)
summary(mod1)
##
## Call:
## lm(formula = log(wage) ~ gender, data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.16529 -0.37589 0.00662 0.36855 1.86145
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.16529 0.03032 71.411 < 2e-16 ***
## genderfemale -0.23125 0.04477 -5.166 3.39e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5155 on 532 degrees of freedom
## Multiple R-squared: 0.04776, Adjusted R-squared: 0.04597
## F-statistic: 26.69 on 1 and 532 DF, p-value: 3.39e-07
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 78 / 92
Qualitative Predictors with More than Two Levels
When a qualitative predictor has more than two levels, a single dummy variable cannot
represent all possible values. In this situation, we can create additional dummy variables.
For example, for the ethnicity variable, we two dummy variables.
levels(CPS1985$ethnicity)
1, if hispanic
xi1 =
0, if not hispanic
1, if other
xi2 =
0, if caucasian or hispanic (not other)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 79 / 92
Qualitative Predictors with More than Two Levels
We will use these variables as predictors in the regression equation. This results in the
model
β0 + ϵi , if caucasian
yi = β0 + β1 xi1 + β2 xi2 + ϵi = β0 + β1 + ϵi , if hispanic
β0 + β2 + ϵi , if other
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 80 / 92
Model 2
mod2 <- lm(log(wage) ~ ethnicity, data = CPS1985)
summary(mod2)
##
## Call:
## lm(formula = log(wage) ~ ethnicity, data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.08810 -0.38335 -0.00865 0.36526 1.70739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.08810 0.02499 83.549 < 2e-16 ***
## ethnicityhispanic -0.26955 0.10394 -2.593 0.00977 **
## ethnicityother -0.12177 0.06875 -1.771 0.07710 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5242 on 531 degrees of freedom
## Multiple R-squared: 0.0169, Adjusted R-squared: 0.0132
## F-statistic: 4.565 on 2 and 531 DF, p-value: 0.01083
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 81 / 92
Model 3
mod3 <- lm(log(wage) ~ education+experience+age+gender, data = CPS1985)
summary(mod3)
##
## Call:
## lm(formula = log(wage) ~ education + experience + age + gender,
## data = CPS1985)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.15564 -0.30705 0.00479 0.30833 1.99432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.15357 0.69387 1.663 0.097 .
## education 0.17746 0.11371 1.561 0.119
## experience 0.09234 0.11375 0.812 0.417
## age -0.07961 0.11365 -0.700 0.484
## genderfemale -0.25736 0.03948 -6.519 1.66e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4525 on 529 degrees of freedom
## Multiple R-squared: 0.2703, Adjusted R-squared: 0.2648
## F-statistic: 48.99 on 4 and 529 DF, p-value: < 2.2e-16
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 82 / 92
Checking Collinearity
library(GGally)
ggpairs(CPS1985 %>% dplyr::select(education,experience,age,gender))
education
0.2
Corr: Corr:
−0.353*** −0.150***
0.1
0.0
40
experience
Corr:
20
0.978***
60
50
age
40
30
20
90
60
30
gender
0
90
60
30
0
5 10 15 0 20 40 20 30 40 50 60 male female
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 83 / 92
Variance Inflation Factor (VIF) and Tolerance
It is a measure of how much the variance of the estimated regression coefficient βk is
“inflated” by the existence of correlation among the predictor variables in the model.
A VIF of 1 means that there is no correlation among the k-th predictor and the remaining
predictor variables, and hence the variance of βk is not inflated at all.
The general rule of thumb is that VIFs exceeding 4 warrant further investigation, while
VIFs exceeding 10 are signs of serious multicollinearity requiring correction.
Tolerance is percent of variance in the predictor that cannot be accounted for by other
predictors.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 84 / 92
Variance Inflation Factor (VIF) and Tolerance
library(mctest)
vif(mod3)
1/vif(mod3) #tolerance
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 85 / 92
Farrar - Glauber Test (Overall Collinearity Check)
omcdiag(mod3)
##
## Call:
## omcdiag(mod = mod3)
##
##
## Overall Multicollinearity Diagnostics
##
## MC Results detection
## Determinant |X'X|: 0.0002 1
## Farrar Chi-Square: 4553.6699 1
## Red Indicator: 0.4311 0
## Sum of Lambda Inverse: 10016.7841 1
## Theil's Method: 2.1935 1
## Condition Number: 556.6117 1
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 86 / 92
Farrar - Glauber Test (Overall Collinearity Check)
The standardized determinant, |X ′ X | is found to be 0.0002 which is very small.
Farrar - Glauber Test
The χ2 test statistic is 4553.6699 and it is highly significant thereby implying the presence
of multicollinearity in the model specification.
This motivates us to go for the next step of Farrar - Glauber test (F -test) for the location
of the multicollinearity.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 87 / 92
Farrar - Glauber Test (Individual Collinearity Check)
imcdiag(mod3)
##
## Call:
## imcdiag(mod = mod3)
##
##
## All Individual Multicollinearity Diagnostics Result
##
## VIF TOL Wi Fi Leamer CVIF Klein
## education 230.2245 0.0043 40496.3252 60859.1001 0.0659 222.0259 1
## experience 5162.0599 0.0002 911787.2555 1370261.4132 0.0139 4978.2320 1
## age 4623.4904 0.0002 816639.9719 1227271.2031 0.0147 4458.8417 1
## genderfemale 1.0093 0.9908 1.6384 2.4623 0.9954 0.9733 0
## IND1 IND2
## education 0.0000 1.3256
## experience 0.0000 1.3311
## age 0.0000 1.3311
## genderfemale 0.0056 0.0122
##
## 1 --> COLLINEARITY is detected by the test
## 0 --> COLLINEARITY is not detected by the test
##
## education , experience , age , coefficient(s) are non-significant may be due to multicollinearity
##
## R-square of y on all x: 0.2703
##
## * use method argument to check which regressors may be the reason of collinearity
## ===================================
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 88 / 92
Farrar - Glauber Test (Individual Collinearity Check)
The above output shows that Education, Experience and Age have multicollinearity. Also,
the VIF value is very high for these variables. Finally, let’s move to examine the pattern of
multicollinearity and conduct t-test for correlation coefficients.
education<-CPS1985[,2]
experience<-CPS1985[,3]
age<-CPS1985[,4]
gender<-CPS1985[,7]
x<- cbind(education,experience,age,gender)
library(ppcor)
pcor(x)
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 89 / 92
t-test for correlation coefficients
## $estimate
## education experience age gender
## education 1.00000000 -0.99777529 0.99751429 0.05316235
## experience -0.99777529 1.00000000 0.99988864 0.05233923
## age 0.99751429 0.99988864 1.00000000 -0.05113531
## gender 0.05316235 0.05233923 -0.05113531 1.00000000
##
## $p.value
## education experience age gender
## education 0.0000000 0.000000 0.0000000 0.2208851
## experience 0.0000000 0.000000 0.0000000 0.2281270
## age 0.0000000 0.000000 0.0000000 0.2390205
## gender 0.2208851 0.228127 0.2390205 0.0000000
##
## $statistic
## education experience age gender
## education 0.000000 -344.556354 325.901918 1.225622
## experience -344.556354 0.000000 1542.494566 1.206593
## age 325.901918 1542.494566 0.000000 -1.178765
## gender 1.225622 1.206593 -1.178765 0.000000
##
## $n
## [1] 534
##
## $gp
## [1] 2
##
## $method
## [1] "pearson"
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 90 / 92
Findings & Remedial Measures
As expected the high partial correlation between ‘age’ and ‘experience’ is found to be
highly statistically significant. Similar is the case for ‘education – experience’ and
‘education – age’ .
As a remedial measure, we can build a model by excluding ‘experience’, estimate the
model and go for further diagnosis for the presence of multicollinearity.
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 91 / 92
Model 4: Result
Dependent variable:
log(wage)
education 0.085∗∗∗ (0.008)
age 0.013∗∗∗ (0.002)
genderfemale −0.256∗∗∗ (0.039)
Constant 0.600∗∗∗ (0.126)
Observations 534
R2 0.269
Adjusted R2 0.265
Residual Std. Error 0.452 (df = 530)
F Statistic 65.141∗∗∗ (df = 3; 530)
∗ ∗∗ ∗∗∗
Note: p<0.1; p<0.05; p<0.01
Advanced Data Analysis, AY 2022-23 Correlation & Regression Analysis 01 November 2022 92 / 92