0% found this document useful (0 votes)
23 views40 pages

10 - APM 1205 Linear Model

This document discusses various techniques for diagnosing statistical regression models, including evaluating model fit, analyzing residuals, and checking assumptions. It outlines steps to check for nonlinearity in data, constant variance, and non-multicollinearity when evaluating model fit. It then describes residual analysis to check if residuals are normally distributed, have constant variance, and independent error terms. Various plots and tests are provided to diagnose these assumptions. The document also discusses detecting influential outliers and leverage points that can impact the model.

Uploaded by

Teddy Bonitez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views40 pages

10 - APM 1205 Linear Model

This document discusses various techniques for diagnosing statistical regression models, including evaluating model fit, analyzing residuals, and checking assumptions. It outlines steps to check for nonlinearity in data, constant variance, and non-multicollinearity when evaluating model fit. It then describes residual analysis to check if residuals are normally distributed, have constant variance, and independent error terms. Various plots and tests are provided to diagnose these assumptions. The document also discusses detecting influential outliers and leverage points that can impact the model.

Uploaded by

Teddy Bonitez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

Model Diagnostics

Importance of Model Diagnostic


• Evaluate the goodness fit internally
• Answer if the model assumption holds

i. Non linearity of data


ii. Homoscedasticity (constant variance)
iii. Non-Multicollinearity
Residual Analysis
• Residuals are differences between the one-step-predicted
output from the model and the measured output from the
validation data set. Thus, residuals represent the portion of
the validation data not explained by the model.
• Residual analysis consists of two tests: the whiteness test
and the independence test.

𝜀^ =𝑦 − 𝑋 ^𝛽= y − ^
𝑦
Residual Analysis
Things to look into:
1. Normality of the Residuals
2. Homogeneity of the Residual Variance
3. Independency of the Residual Error Terms
Residual Analysis
Normality
Test:
Residual vs. Fitted Plot [plot function
plot(model,1)].
• For the normality assumption to
hold, the residuals should spread
randomly around 0 and form a
horizontal band.
• If the red trend line is approximately
flat and close to zero, then one can
assume that the residuals are
normally distributed
Residual Analysis
Normality
Test:
Q-Q Plot [plot function
plot(model,2)].

If the Q-Q plot forms a diagonal


line, you can assume that the
residuals follow a normal
distribution
Residual Analysis
Normality
Test:
Histogram:
hist(model$residuals,prob=TRUE)].

To not violate the normality


assumption, the histogram should
be centered around zero and should
show a bell-shaped curve
Residual Analysis
Normality
Test:
Histogram with normal curve:
1. x<-model$residuals
2. x2 <- seq(min(x), max(x), length = 40)
3. fun <- dnorm(x2, mean = mean(x), sd = sd(x))
4. hist(x, prob = TRUE, col = "white", ylim = c(0,
max(fun)), main = "Histogram with normal
curve")
5. lines(x2, fun, col = 2, lwd = 2)
Residual Analysis
Normality
Test: Boxplot
boxplot(model$residuals)].

Check if most of the data are


inside the BOX and the whiskers
are equal in length, we assume the
normality of the data
Residual Analysis
Normality

Test: Kolmogorov-Smirnov and Shapiro-Wilk test


ks.test(model$residuals,”pnorm”)
shapiro.test(model$residuals)

if the p-value is below 0.05, we reject the null hypothesis and


conclude that the residuals do not follow a normal
distribution
Residual Analysis
Constancy of Variance
(Homoscedasticity)

Homoscedasticity is the
situation in which the variance
of the residuals of a regression
model the same across all
values of the predicted variable
Residual Analysis
Heteroscedasticity
Test:
Q-Q Plot and Scale Location Plot
[plot function: plot(model,2) and
plot(model,3)].

if variability does neither increase


nor decrease with the fitted values.
Therefore, we can assume that this
regression model does not violate
the homoscedasticity assumption.
Residual Analysis
Heteroscedasticity
Test:
Breusch–Pagan Test
This test checks whether the variance of the residuals depends on the
value of the independent variable
lmtest::bptest(model)

If the test statistic (BP) is small and the p-value is not significant (i.e.,
>0.05), we do not reject the null hypothesis. Therefore, we assume
that the residuals are homoscedastic.
Residual Analysis
Independency of Error Term
Test:
White Test to Check Heteroscedasticity
This test checks whether the variance of the residuals depends on the value of the
independent variable
skedastic::white_lm(model)

If the Because the p-value is not significant (i.e., >0.05) we do not reject the null
hypothesis. Hence, we assume that the residuals are homoscedastic.
Residual Analysis
Autocorrelation of Error Terms

Recall

• Autocorrelation, also known as serial correlation, refers to the degree of correlation


of the same variables between two successive time intervals
• Found mostly on time series data
• If the errors themselves are autocorrelated, often this will be reflected in the
regression residuals also being autocorrelated
Residual Analysis
Autocorrelation of Error Terms

Durbin Watson Test

• Rejecting the null hypothesis means there a presence to


autocorrelation
• This can interpreted that historical or time series variable have
influence in the
Outliers Detection
Outlier
• an outlier is an observation for which the residual is large in
magnitude compared to other observations in the data set.
• Defined as abnormal values in a dataset that don’t go with the
regular distribution and have the potential to significantly distort
any regression model
• These points are especially important because they can have a
strong influence on the least squares line.
Why should we be concerned about it?
• We want the fit is not overwhelmingly influences by one or
few observations only
• Large deviation from the straight line implies minimization of
the sum of squares focus on that point only, allowing less
contributions for more “normal” points
• We target for a stable set of parameters estimates. As mush
as possible we want to factor out all forces that could result
to undue influences to the model
Types of Outliers
(1) Response/Dependent Variable Outlier.
• Observation are large standardized residuals that are far from the rest of responses.
• Observation with standardized residuals that are 2 or 2 standard deviation away from
0 are considered outlier
• This will cause of the model failure in its prediction
(2) Regression/Independent Variable outlier.
• An observation that has an unusual value of the dependent variable , conditional on
its value of the independent variable .
• A regression outlier will have a large residual but not necessarily affect the regression
slope coefficient
• Measured in term of leverage (weight, contribution through the design matrix) of the
point to the configuration of the design matrix X
Leverage points
Leverage. Points that fall
horizontally away from the center
of the cloud tend to pull harder on
the line, so we call them points
with high leverage.
An observation that has an
unusual value-i.e., it is far from
has leverage on the regression line
Leverage points
A good leverage point
• is a point that is unusually large or small among the X values but
is not a regression outlier. That is, the point is relatively removed
from the bulk of the observation but reasonably close to the line
around which most of the points are centered.
• A good leverage point has limited effect on giving a distorted
view of how majority of points are associated.
• Good leverage points improve the precision of the regression
coefficients.
A bad leverage point
• is a point situated far from the regression line* around which the
bulk of the points are centered. Said another way, a bad leverage
point is a regression outlier that has an X value that is an outlier
among X values as well (it is relatively far removed from the
regression line).
• Bad leverage point has grossly effect estimate of the slope of the
regression line if an estimator with a small breakdown point is
used. Bad leverage points reduce the precision of the regression
coefficients.
Influential
High leverage points that actually
influence the slope of the
regression line are called
influential points.
Influential vs High Leverage
Observation
A data point is influential if its unduly
influences any part of a regression
analysis, such as the predicted
responses, the estimated slope
coefficients, or the hypothesis test
results.
Outliers and high leverage data points
have the potential to be influential, but
we generally have to investigate
further to determine whether or not
they are actually influential.
Effect of Outlier
• High Leverage observations should also be examines to the general fit of
the model
• High leverage observations may hide outliers. This is because high leverage
= good fit the point  small residuals (non-outlier)
• Outlier may masked other outliers. This is because presence of several
outliers could increase standard deviation of the residuals hence 3 SD band
could be wide enough
Hat Matrix
• The hat matrix is a matrix used in regression analysis and analysis of
variance.
• It is defined as the matrix that converts values from the observed
variable into estimations obtained with the least squares method.
• Leverages from Hat Matrix Measure Potential Influence
• if is the vector formed from estimations calculated from the least
squares parameters, and is a vector of observations related to the
dependent variable, then is given by vector multiplied by that is,
converts to into .
Hat Matrix
′ −1
𝐻= 𝑋 ( 𝑋 𝑋 ) 𝑋 ′
All observation in actually affect any predicted value of
is the leverage of the to its own fit
is in the range of [0,1] 1 high leverage, o low leverage

lm.influence(model)$hat)
X <- model$fitted.values

Manual construction
H <- X %*% solve(t(X) %*% X) %*% t(X)H
Detection
Cooks Distance

based on the change in when the observation is


removed from the dataset.

cooksd <- cooks.distance(model)

plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") #


plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd,
na.rm=T),names(cooksd),""), col="red")
Detection
What to do with outliers?
• They should no be drop or deleted or down-
weighted because not all of them are bad
• They can still be source of additional information
Detection
What if it is correctly identified?
• Do not drop without any justification
• Need to have corrective action
• Correct the date
• Delete/down weighing
• Transformation
• Consider a different model
• Re-design experiment/survey
• Collect more data
Multicollinearity
Multicollinearity
• The problem of multicollinearity exist when the joint
association of the independent variables affect the model
process
• Pairwise correlation of independent variables will not
necessarily lead to multicollinearity
• Absence of pairwise correlation of independent variables
will not necessarily indicate absence of multicollinearity
• Joint correlation of the independent variables will not be a
problem if it is weak enough not to affect the modelling
process
Multicollinearity
Recall Independency of Linear Model:

Independence means that there is no relation between the different


variables.

If are n – dimensional vectors and if there exists constants not all zero
such that then the vectors are linearly independent

Linearly dependent does not exist


Multicollinearity
Recall Partial Correlation:
Partial correlation measures the strength of a relationship between two variables,
while controlling for the effect of one or more other variables

Formally, the partial correlation between and given a set of n controlling variables ,
written , is the correlation between the residuals and resulting from the linear
regression of with and of with

is partial correlated with is a good least square estimator

Presence of multicollinearity
Multicollinearity
Primary Source of the Issue

a. The data collection method employed


b. Constraints in the model of in the population
c. Model specification
d. Over defined model (overparameterized)
Multicollinearity
Effects of Multicollinearity

a. The become very sensitive to small changes in the model.


This is due to the change of interaction between
b. Multicollinearity reduces the precision of the estimated
coefficients, which weakens the statistical power of your
regression model.
Multicollinearity
Detection and Analysis

a. Signs of the coefficient are reversed


b. Correlation Matrix ( High coefficient)
c. High Variance Inflation Factors (VIFs)
Multicollinearity
Variance Inflation Factors (VIF)

• measures how much the variance of an independent variable is


influenced, or inflated, by its interaction/correlation with the other
independent variables. It indicates which variable/s are affected by
multicollinearity
• VIF is the ith diagonal element of
• A indicates multicollinearity within independent variables it can be
interpreted that 50% higher than what could be expected if there was
no multicollinearity between the independent variables
• A indicates a severe variance inflation for the parameter estimate
associated with
Multicollinearity
Variance Inflation Factors (VIF)

where is the coefficient of determination obtained when regressing on


the other p-1 independent variables. is the tolerance factors
Multicollinearity
Potential Solutions

• Remove some of the highly correlated


independent variables.
• Linearly combine the independent variables, such
as adding them together.
• Compounding
i. Representation
ii. Indexing

You might also like