0% found this document useful (0 votes)
10 views8 pages

W7 - Assumptions

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views8 pages

W7 - Assumptions

Uploaded by

z13612909240
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

W7 – Assumptions

Introduction

All the tests we have been doing so far (week 2-3-4): t-test one sample, 2 samples t-test, linear
regression t-test etc… are so called parametric tests. When doing research, If you want to perform
these tests you should check some assumptions first (week 7). If the assumptions are not met, we
should therefore use non-parametric test (week 8). Following there is a summary of the
assumptions, next we will go in detail using the assignments.

When doing regression analysis [ lm(y ~ x, …… ) in R studio] we should check 4 main assumptions,
given the assignments, I see you just focus on 2 of them, so we will only focus on the 2 first ones.

- 1st : Normality distribution of the residuals


- 2nd : Equal variance of the residuals also called HOMO or HETEROSCEDASTICITY
- 3rd : Linearity and 4th : Independence

These assumptions can be checked in three ways:

- Statistically : Using tests and analyzing their P-Values


- Graphically : Using plots
- Using descriptive statistics (var, st.deviations, mean, median etc…)

I – Checking the assumptions Statistically:

1. Normality:
H0: Normal distribution vs HA: Not normal
If P-Value <0.05 we reject normal distribution: NO normal distribution
If P-Value >0.05 we accept normal distribution: YES normal distribution

Example and Interpretation:


Shapiro-Wilk normality test
data: data560$res2
W = 0.96348, p-value = 7.321e-07 <- P-Value <0.05, not normal distribution

2. Equal variance:
H0: Equal variance vs HA: Not equal variance
If P-Value <0.05 we reject equal variance: NO equal variance
If P-Value >0.05 we accept equal variance : YES equal variance

Example and Interpretation:


studentized Breusch-Pagan test
data: model1
BP = 6.053, df = 3, p-value = 0.154 -> P-Value > 0.05, yes equal variance

Levene's Test for Homogeneity of Variance (center = median)


Df F value Pr(>F)
group 2 12.1055 0.00798 -> P-Value < 0.05, no equal variance
42
II – Checking the assumptions Graphically

1. Normality

- Histograms: A “good” histogram should be unimodal (only one peak) and bell shaped. On
the right one can see that the histogram is right skewed (the tail is on the right).

- Normal QQ plots: The first plot mostly shows normal distribution as the dots (residuals) are
following the straight line except for low values. However, the second plot clearly shows
strong deviations form the straight line, suggesting strong violation of the normal
distribution.
2. Equal variance:

- Residuals plots : The residuals (dots) should be equally spread throughout the blue line, you
shouldn’t see any pattern.
o 1st plot: Good, there is equal variance in the residuals.
o 2nd and 3rd not good, there is a violation of the equal variance

- Boxplots: The spread of the residuals should be similar between the groups, for the first
example it’s fine, the second is not !
Example with Assignment 560:

Context: Suppose you interested in the relationship between crime and punishment: you expect
that crime is strongly related to the level of punishment, and that other factors do not play a role.
(the severity of crime is positively associated with the level punishment).

- Y = Punishment (dependent variable)


- X = Crime (independent variable)

Steps to check the assumptions in R:

- 1st : Make your model using lm


- 2nd : Add residuals and predicted values
- 3rd : Graphically check the assumptions
- 4th : Check them also using the statistical tests

Codes in red and interpretations in blue

1st : Creating the model: model1 <- data560 %>% lm(punish ~ crime, . )

2nd : add residuals and predicted

- res <- model1$residuals


- pred <- model1$fitted.values

3rd : Assumptions graphically:

- For normality:
o hist(data560$res)
o plot(model1,2)
- For equal variance:
o plot(model1,1)
o plot(model1,3)

Outputs + : respectively : plot(model,2) ; plot(model1,1) ; plot(model1,3)

Interpretation: From the normal QQ plot, we can see that there are quite a lot of deviations from the
straight dotted line, this is a violation of the normal distribution. From 2nd it seems that the equal
variance is mostly fine. However, from the 3rd plot we can see that the equal variance is not met.
4th : Assumptions statistically

- For normality: shapiro.test(res)


- For equal variance: bptest(model1)
- *Remember to run the libraries first, note that I chose breush pagan because my
independent variable is crime which is a scale variable
Shapiro-Wilk normality test

data: data560$res
W = 0.96967, p-value = 2.87e-06 -> P-Value <0.05 so not normal distribution

studentized Breusch-Pagan test

data: model1
BP = 7.1701, df = 1, p-value = 0.007413 -> P-value <0.05 so not equal variance

Steps more in detail, what did we do in R ?:

1st: Make your model with lm

- At this point you can already:


o Check R squared for quality of the model
o You can check the significance of the variables

2nd : Add residuals and predicted values

- Why? -> In order to check the assumptions


o Equal variance
o Normality

3rd : Graphically check the assumptions

- For equal variance: 2 options :


o 1st : use ggplot formula (See assignment)
 Residuals (Y) against predicted (X)
 Residuals (Y) against each X(independent variable)
o 2nd : use plot() formula: The “easy method”
 Plot(model_name,1) : for equal variance
 Plot(model_name,2) : for normal distribution
- For normality:
o 1st : plot(model_name,2)
o 2nd : hist(residuals)

4th : Test assumptions with statistics

- For normal distribution: Use Shapiro Wilk’s test


- For equal variance, it depends: Levene or Breush pagan?
o Breush pagan if scale independent variable.
o Levene if categorical independent variable.
- If at least one independent variable is scale we choose Breush Pagan (Even if the other
independent variable is categorical)

Assignment 561 : First part (From 1 to question 8)

Context: Suppose we study students, measure their reading abilities (on a scale from 1 to 10) at t1,
and do an experiment with a control group and two different treatments assumed to affect their
‘ability to read’. You first want to understand differences in reading ability at t1. In the first part of
the assignment, you suspect that both income of parents and whether parents read at home to
their children contribute to the reading abilities of the children.

- Model: Y(Reading) = b0 + b1(income) + b2(parents_read_home)


o We have:
o Income is scale
o Read at home is dummy

Checking assumptions Graphically: See Assignment Answers

Checking assumptions Statistically:

Codes for Normality: shapiro.test(data561 $res1)

Output:
Shapiro-Wilk normality test
data: data561$res1
W = 0.93781, p-value = 0.004326

Hypothesis:

- H0: Normal distribution


- HA: Not normal distribution
- Interpretation: P-Value <0.05 so we reject normality

Codes for Equal variance: bptest(model1)


-> Remember we use Breush Pagan because income is a scale variable

Output:
studentized Breusch-Pagan test
data: model1
BP = 23.053, df = 2, p-value = 9.865e-06

Hypothesis + Interpretation:

- H0: Equal variance


- H1: Not equal variance
- P-Value <0.05 so we reject equal variance
Assignment 561 : Second part: The experiment: (From 9 to question 14)

Content: In order to improve reading abilities, a school starts experimenting different teaching
methods. The school randomly assign children to three groups. One group (the control group) gets
an extra hour of class reading by the teacher. Two other groups are approached in a more
personalized way. One gets a reading app, in which famous actors read children stories, another
group gets a volunteer reading to them. Suppose you want to study the effect of one of the three
conditions (one of which is a control group).

- Model: Y(Reading) = B0 + B1(Teaching methods) where


o Teaching methods is a categorical variable

Checking assumptions Graphically: See Assignment Answers

Checking assumptions Statistically:

Codes for Normality: shapiro.test(data561 $res2)

Output:
Shapiro-Wilk normality test
data: data561$res2
W = 0.90059, p-value = 0.0001382

Hypothesis + Interpretation:

- H0: Normal distribution


- HA: Not normal distribution.
- Interpretation: P-Value <0.05 so we reject normality

Codes for Equal variance: leveneTest (model2)


-> Remember we use Levene because teaching methods is a nominal variable with multiple groups

Output:
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 8.1055 0.000798 ***
57

Hypothesis:

- H0: Equal variance


- H1: Not equal variance
- P-Value <0.05 so we reject equal variance
Summary Assumptions

normality Equal variance


Graphs Histogram -> Residual analysis (residuals against
hist(data_name$res) predicted, and each independent
variable)
normal QQplot -> ggplot() formula
plot(model_name,2)
plot(model_name,1)
plot(model_name,3)

Test Shapiro wilk Levene or Breush-pagan


H0: There is normal H0: There is equal variance
distribution HA: There is not
HA: There is not
Levene if my independent variable
is categorical (if I have multiple
groups)
Breush if at least 1 independent
variable is scale.

Descriptives Skeweness should be St.deviations


between -0.5 and+0.5 for Rule of thumb:
normal distributed data var(biggest) / var(smallest)^2

Kurtosis should be between - -> if < 2: Equal variance assumed


3 and +3 for normal -> if > 2 : not equal variance
distributed data

You might also like