W7 - Assumptions
W7 - Assumptions
Introduction
All the tests we have been doing so far (week 2-3-4): t-test one sample, 2 samples t-test, linear
regression t-test etc… are so called parametric tests. When doing research, If you want to perform
these tests you should check some assumptions first (week 7). If the assumptions are not met, we
should therefore use non-parametric test (week 8). Following there is a summary of the
assumptions, next we will go in detail using the assignments.
When doing regression analysis [ lm(y ~ x, …… ) in R studio] we should check 4 main assumptions,
given the assignments, I see you just focus on 2 of them, so we will only focus on the 2 first ones.
1. Normality:
H0: Normal distribution vs HA: Not normal
If P-Value <0.05 we reject normal distribution: NO normal distribution
If P-Value >0.05 we accept normal distribution: YES normal distribution
2. Equal variance:
H0: Equal variance vs HA: Not equal variance
If P-Value <0.05 we reject equal variance: NO equal variance
If P-Value >0.05 we accept equal variance : YES equal variance
1. Normality
- Histograms: A “good” histogram should be unimodal (only one peak) and bell shaped. On
the right one can see that the histogram is right skewed (the tail is on the right).
- Normal QQ plots: The first plot mostly shows normal distribution as the dots (residuals) are
following the straight line except for low values. However, the second plot clearly shows
strong deviations form the straight line, suggesting strong violation of the normal
distribution.
2. Equal variance:
- Residuals plots : The residuals (dots) should be equally spread throughout the blue line, you
shouldn’t see any pattern.
o 1st plot: Good, there is equal variance in the residuals.
o 2nd and 3rd not good, there is a violation of the equal variance
- Boxplots: The spread of the residuals should be similar between the groups, for the first
example it’s fine, the second is not !
Example with Assignment 560:
Context: Suppose you interested in the relationship between crime and punishment: you expect
that crime is strongly related to the level of punishment, and that other factors do not play a role.
(the severity of crime is positively associated with the level punishment).
1st : Creating the model: model1 <- data560 %>% lm(punish ~ crime, . )
- For normality:
o hist(data560$res)
o plot(model1,2)
- For equal variance:
o plot(model1,1)
o plot(model1,3)
Interpretation: From the normal QQ plot, we can see that there are quite a lot of deviations from the
straight dotted line, this is a violation of the normal distribution. From 2nd it seems that the equal
variance is mostly fine. However, from the 3rd plot we can see that the equal variance is not met.
4th : Assumptions statistically
data: data560$res
W = 0.96967, p-value = 2.87e-06 -> P-Value <0.05 so not normal distribution
data: model1
BP = 7.1701, df = 1, p-value = 0.007413 -> P-value <0.05 so not equal variance
Context: Suppose we study students, measure their reading abilities (on a scale from 1 to 10) at t1,
and do an experiment with a control group and two different treatments assumed to affect their
‘ability to read’. You first want to understand differences in reading ability at t1. In the first part of
the assignment, you suspect that both income of parents and whether parents read at home to
their children contribute to the reading abilities of the children.
Output:
Shapiro-Wilk normality test
data: data561$res1
W = 0.93781, p-value = 0.004326
Hypothesis:
Output:
studentized Breusch-Pagan test
data: model1
BP = 23.053, df = 2, p-value = 9.865e-06
Hypothesis + Interpretation:
Content: In order to improve reading abilities, a school starts experimenting different teaching
methods. The school randomly assign children to three groups. One group (the control group) gets
an extra hour of class reading by the teacher. Two other groups are approached in a more
personalized way. One gets a reading app, in which famous actors read children stories, another
group gets a volunteer reading to them. Suppose you want to study the effect of one of the three
conditions (one of which is a control group).
Output:
Shapiro-Wilk normality test
data: data561$res2
W = 0.90059, p-value = 0.0001382
Hypothesis + Interpretation:
Output:
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 8.1055 0.000798 ***
57
Hypothesis: