0% found this document useful (0 votes)

23 views40 pages

10 - APM 1205 Linear Model

This document discusses various techniques for diagnosing statistical regression models, including evaluating model fit, analyzing residuals, and checking assumptions. It outlines steps to check for nonlinearity in data, constant variance, and non-multicollinearity when evaluating model fit. It then describes residual analysis to check if residuals are normally distributed, have constant variance, and independent error terms. Various plots and tests are provided to diagnose these assumptions. The document also discusses detecting influential outliers and leverage points that can impact the model.

Uploaded by

Teddy Bonitez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views40 pages

10 - APM 1205 Linear Model

Uploaded by

Teddy Bonitez

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 40

Model Diagnostics

Importance of Model Diagnostic

• Evaluate the goodness fit internally
• Answer if the model assumption holds

i. Non linearity of data

ii. Homoscedasticity (constant variance)
iii. Non-Multicollinearity
Residual Analysis
• Residuals are differences between the one-step-predicted
output from the model and the measured output from the
validation data set. Thus, residuals represent the portion of
the validation data not explained by the model.
• Residual analysis consists of two tests: the whiteness test
and the independence test.

𝜀^ =𝑦 − 𝑋 ^𝛽= y − ^
𝑦
Residual Analysis
Things to look into:
1. Normality of the Residuals
2. Homogeneity of the Residual Variance
3. Independency of the Residual Error Terms
Residual Analysis
Normality
Test:
Residual vs. Fitted Plot [plot function
plot(model,1)].
• For the normality assumption to
hold, the residuals should spread
randomly around 0 and form a
horizontal band.
• If the red trend line is approximately
flat and close to zero, then one can
assume that the residuals are
normally distributed
Residual Analysis
Normality
Test:
Q-Q Plot [plot function
plot(model,2)].

If the Q-Q plot forms a diagonal

line, you can assume that the
residuals follow a normal
distribution
Residual Analysis
Normality
Test:
Histogram:
hist(model$residuals,prob=TRUE)].

To not violate the normality

assumption, the histogram should
be centered around zero and should
show a bell-shaped curve
Residual Analysis
Normality
Test:
Histogram with normal curve:
1. x<-model$residuals
2. x2 <- seq(min(x), max(x), length = 40)
3. fun <- dnorm(x2, mean = mean(x), sd = sd(x))
4. hist(x, prob = TRUE, col = "white", ylim = c(0,
max(fun)), main = "Histogram with normal
curve")
5. lines(x2, fun, col = 2, lwd = 2)
Residual Analysis
Normality
Test: Boxplot
boxplot(model$residuals)].

Check if most of the data are

inside the BOX and the whiskers
are equal in length, we assume the
normality of the data
Residual Analysis
Normality

Test: Kolmogorov-Smirnov and Shapiro-Wilk test

ks.test(model$residuals,”pnorm”)
shapiro.test(model$residuals)

if the p-value is below 0.05, we reject the null hypothesis and

conclude that the residuals do not follow a normal
distribution
Residual Analysis
Constancy of Variance
(Homoscedasticity)

Homoscedasticity is the
situation in which the variance
of the residuals of a regression
model the same across all
values of the predicted variable
Residual Analysis
Heteroscedasticity
Test:
Q-Q Plot and Scale Location Plot
[plot function: plot(model,2) and
plot(model,3)].

if variability does neither increase

nor decrease with the fitted values.
Therefore, we can assume that this
regression model does not violate
the homoscedasticity assumption.
Residual Analysis
Heteroscedasticity
Test:
Breusch–Pagan Test
This test checks whether the variance of the residuals depends on the
value of the independent variable
lmtest::bptest(model)

If the test statistic (BP) is small and the p-value is not significant (i.e.,
>0.05), we do not reject the null hypothesis. Therefore, we assume
that the residuals are homoscedastic.
Residual Analysis
Independency of Error Term
Test:
White Test to Check Heteroscedasticity
This test checks whether the variance of the residuals depends on the value of the
independent variable
skedastic::white_lm(model)

If the Because the p-value is not significant (i.e., >0.05) we do not reject the null
hypothesis. Hence, we assume that the residuals are homoscedastic.
Residual Analysis
Autocorrelation of Error Terms

Recall

• Autocorrelation, also known as serial correlation, refers to the degree of correlation

of the same variables between two successive time intervals
• Found mostly on time series data
• If the errors themselves are autocorrelated, often this will be reflected in the
regression residuals also being autocorrelated
Residual Analysis
Autocorrelation of Error Terms

Durbin Watson Test

• Rejecting the null hypothesis means there a presence to

autocorrelation
• This can interpreted that historical or time series variable have
influence in the
Outliers Detection
Outlier
• an outlier is an observation for which the residual is large in
magnitude compared to other observations in the data set.
• Defined as abnormal values in a dataset that don’t go with the
regular distribution and have the potential to significantly distort
any regression model
• These points are especially important because they can have a
strong influence on the least squares line.
Why should we be concerned about it?
• We want the fit is not overwhelmingly influences by one or
few observations only
• Large deviation from the straight line implies minimization of
the sum of squares focus on that point only, allowing less
contributions for more “normal” points
• We target for a stable set of parameters estimates. As mush
as possible we want to factor out all forces that could result
to undue influences to the model
Types of Outliers
(1) Response/Dependent Variable Outlier.
• Observation are large standardized residuals that are far from the rest of responses.
• Observation with standardized residuals that are 2 or 2 standard deviation away from
0 are considered outlier
• This will cause of the model failure in its prediction
(2) Regression/Independent Variable outlier.
• An observation that has an unusual value of the dependent variable , conditional on
its value of the independent variable .
• A regression outlier will have a large residual but not necessarily affect the regression
slope coefficient
• Measured in term of leverage (weight, contribution through the design matrix) of the
point to the configuration of the design matrix X
Leverage points
Leverage. Points that fall
horizontally away from the center
of the cloud tend to pull harder on
the line, so we call them points
with high leverage.
An observation that has an
unusual value-i.e., it is far from
has leverage on the regression line
Leverage points
A good leverage point
• is a point that is unusually large or small among the X values but
is not a regression outlier. That is, the point is relatively removed
from the bulk of the observation but reasonably close to the line
around which most of the points are centered.
• A good leverage point has limited effect on giving a distorted
view of how majority of points are associated.
• Good leverage points improve the precision of the regression
coefficients.
A bad leverage point
• is a point situated far from the regression line* around which the
bulk of the points are centered. Said another way, a bad leverage
point is a regression outlier that has an X value that is an outlier
among X values as well (it is relatively far removed from the
regression line).
• Bad leverage point has grossly effect estimate of the slope of the
regression line if an estimator with a small breakdown point is
used. Bad leverage points reduce the precision of the regression
coefficients.
Influential
High leverage points that actually
influence the slope of the
regression line are called
influential points.
Influential vs High Leverage
Observation
A data point is influential if its unduly
influences any part of a regression
analysis, such as the predicted
responses, the estimated slope
coefficients, or the hypothesis test
results.
Outliers and high leverage data points
have the potential to be influential, but
we generally have to investigate
further to determine whether or not
they are actually influential.
Effect of Outlier
• High Leverage observations should also be examines to the general fit of
the model
• High leverage observations may hide outliers. This is because high leverage
= good fit the point  small residuals (non-outlier)
• Outlier may masked other outliers. This is because presence of several
outliers could increase standard deviation of the residuals hence 3 SD band
could be wide enough
Hat Matrix
• The hat matrix is a matrix used in regression analysis and analysis of
variance.
• It is defined as the matrix that converts values from the observed
variable into estimations obtained with the least squares method.
• Leverages from Hat Matrix Measure Potential Influence
• if is the vector formed from estimations calculated from the least
squares parameters, and is a vector of observations related to the
dependent variable, then is given by vector multiplied by that is,
converts to into .
Hat Matrix
′ −1
𝐻= 𝑋 ( 𝑋 𝑋 ) 𝑋 ′
All observation in actually affect any predicted value of
is the leverage of the to its own fit
is in the range of [0,1] 1 high leverage, o low leverage

lm.influence(model)$hat)
X <- model$fitted.values

Manual construction
H <- X %*% solve(t(X) %*% X) %*% t(X)H
Detection
Cooks Distance

based on the change in when the observation is

removed from the dataset.

cooksd <- cooks.distance(model)

plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") #

plot cook's distance
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd,
na.rm=T),names(cooksd),""), col="red")
Detection
What to do with outliers?
• They should no be drop or deleted or down-
weighted because not all of them are bad
• They can still be source of additional information
Detection
What if it is correctly identified?
• Do not drop without any justification
• Need to have corrective action
• Correct the date
• Delete/down weighing
• Transformation
• Consider a different model
• Re-design experiment/survey
• Collect more data
Multicollinearity
Multicollinearity
• The problem of multicollinearity exist when the joint
association of the independent variables affect the model
process
• Pairwise correlation of independent variables will not
necessarily lead to multicollinearity
• Absence of pairwise correlation of independent variables
will not necessarily indicate absence of multicollinearity
• Joint correlation of the independent variables will not be a
problem if it is weak enough not to affect the modelling
process
Multicollinearity
Recall Independency of Linear Model:

Independence means that there is no relation between the different

variables.

If are n – dimensional vectors and if there exists constants not all zero
such that then the vectors are linearly independent

Linearly dependent does not exist

Multicollinearity
Recall Partial Correlation:
Partial correlation measures the strength of a relationship between two variables,
while controlling for the effect of one or more other variables

Formally, the partial correlation between and given a set of n controlling variables ,
written , is the correlation between the residuals and resulting from the linear
regression of with and of with

is partial correlated with is a good least square estimator

Presence of multicollinearity
Multicollinearity
Primary Source of the Issue

a. The data collection method employed

b. Constraints in the model of in the population
c. Model specification
d. Over defined model (overparameterized)
Multicollinearity
Effects of Multicollinearity

a. The become very sensitive to small changes in the model.

This is due to the change of interaction between
b. Multicollinearity reduces the precision of the estimated
coefficients, which weakens the statistical power of your
regression model.
Multicollinearity
Detection and Analysis

a. Signs of the coefficient are reversed

b. Correlation Matrix ( High coefficient)
c. High Variance Inflation Factors (VIFs)
Multicollinearity
Variance Inflation Factors (VIF)

• measures how much the variance of an independent variable is

influenced, or inflated, by its interaction/correlation with the other
independent variables. It indicates which variable/s are affected by
multicollinearity
• VIF is the ith diagonal element of
• A indicates multicollinearity within independent variables it can be
interpreted that 50% higher than what could be expected if there was
no multicollinearity between the independent variables
• A indicates a severe variance inflation for the parameter estimate
associated with
Multicollinearity
Variance Inflation Factors (VIF)

where is the coefficient of determination obtained when regressing on

the other p-1 independent variables. is the tolerance factors
Multicollinearity
Potential Solutions

• Remove some of the highly correlated

independent variables.
• Linearly combine the independent variables, such
as adding them together.
• Compounding
i. Representation
ii. Indexing

ISISO16269 Part4 2010 (Reaffirmed2021)
50% (2)
ISISO16269 Part4 2010 (Reaffirmed2021)
59 pages
STA641 Final Term Paper
50% (2)
STA641 Final Term Paper
11 pages
Math-Camp-Proposal-Edited For LDP Award 12
100% (1)
Math-Camp-Proposal-Edited For LDP Award 12
8 pages
Statistics Teaching Notes For Exams: Mean, Median and Mode
No ratings yet
Statistics Teaching Notes For Exams: Mean, Median and Mode
15 pages
Hubungan Persepsi Mahasiswa Tentang Keluarga Harmonis Dengan Kesiapan Menikah
No ratings yet
Hubungan Persepsi Mahasiswa Tentang Keluarga Harmonis Dengan Kesiapan Menikah
7 pages
Management Science - Chapter 7 - Test Reveiwer
No ratings yet
Management Science - Chapter 7 - Test Reveiwer
10 pages
MBA Business Statistics and Decision Making Solutions Robert A Stine
No ratings yet
MBA Business Statistics and Decision Making Solutions Robert A Stine
10 pages
Probability Answers
No ratings yet
Probability Answers
14 pages
Rekapitulasi Jawaban SPSS Uji Validitas (Ardaninggar)
No ratings yet
Rekapitulasi Jawaban SPSS Uji Validitas (Ardaninggar)
7 pages
F.Y.B.Sc. Computer Science (Statistics) - 14.082019
No ratings yet
F.Y.B.Sc. Computer Science (Statistics) - 14.082019
13 pages
Mathematical Literacy Marking Guideline P1 June 2024
No ratings yet
Mathematical Literacy Marking Guideline P1 June 2024
11 pages
Ch-2.2 Measures of Dispersion
No ratings yet
Ch-2.2 Measures of Dispersion
14 pages
1.untitled: 4. What Is The Mean of Customer Age? Interpret Result
No ratings yet
1.untitled: 4. What Is The Mean of Customer Age? Interpret Result
8 pages
Math 10
No ratings yet
Math 10
7 pages
Exercise 6: Time Series Analysis and Stochastic Modelling
No ratings yet
Exercise 6: Time Series Analysis and Stochastic Modelling
18 pages
SFM A1.1
No ratings yet
SFM A1.1
31 pages
Stats Project Hafid El Hassani Alaoui
No ratings yet
Stats Project Hafid El Hassani Alaoui
12 pages
ME 1 ES Descriptive Analysis Updated
No ratings yet
ME 1 ES Descriptive Analysis Updated
12 pages
Correlation Lecture Notes
No ratings yet
Correlation Lecture Notes
7 pages
Measures of Central Tendency: Mean, Median, & Mode
No ratings yet
Measures of Central Tendency: Mean, Median, & Mode
21 pages
Winners Math Practice
No ratings yet
Winners Math Practice
3 pages
Statistics Sem 3
No ratings yet
Statistics Sem 3
7 pages
CONSULTA, Eunice BPA 3 2 Activity 5
No ratings yet
CONSULTA, Eunice BPA 3 2 Activity 5
5 pages
MGT646
No ratings yet
MGT646
8 pages
Heart Rate Data Assignment
No ratings yet
Heart Rate Data Assignment
5 pages
MAS202 IB1604 HW1 NgoChiThien CS170289
No ratings yet
MAS202 IB1604 HW1 NgoChiThien CS170289
12 pages
Assignment 2 CS Sec#4
No ratings yet
Assignment 2 CS Sec#4
5 pages
Worksheet 1 Descriptive Statistics
No ratings yet
Worksheet 1 Descriptive Statistics
4 pages
Edexcel IGCSE Higher Tier Mathematics 3H November 2010
No ratings yet
Edexcel IGCSE Higher Tier Mathematics 3H November 2010
5 pages
Reviewer
No ratings yet
Reviewer
1 page
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2141)

10 - APM 1205 Linear Model

Uploaded by

10 - APM 1205 Linear Model

Uploaded by

Model Diagnostics

Importance of Model Diagnostic

i. Non linearity of data

If the Q-Q plot forms a diagonal

To not violate the normality

Check if most of the data are

Test: Kolmogorov-Smirnov and Shapiro-Wilk test

if the p-value is below 0.05, we reject the null hypothesis and

if variability does neither increase

• Autocorrelation, also known as serial correlation, refers to the degree of correlation

Durbin Watson Test

• Rejecting the null hypothesis means there a presence to

based on the change in when the observation is

cooksd <- cooks.distance(model)

plot(cooksd, pch="*", cex=2, main="Influential Obs by Cooks distance") #

Independence means that there is no relation between the different

Linearly dependent does not exist

is partial correlated with is a good least square estimator

a. The data collection method employed

a. The become very sensitive to small changes in the model.

a. Signs of the coefficient are reversed

• measures how much the variance of an independent variable is

where is the coefficient of determination obtained when regressing on

• Remove some of the highly correlated

You might also like