Class 13
Class 13
Class 13
Nonparametric Statistics
I. Parametric or nonparametric test?
II. One-sample and paired data (analog of one-sample t-test)
III. Two independent samples (analog of two-sample t-test)
IV. Comparing more than two groups (analog of one-way
ANOVA)
1
I. Parametric or Nonparametric Test?
Parametric or
Nonparametric Test?
2
I. Parametric or Nonparametric Test?
Background
• We have discussed several statistical tests, including
the one-sample (or paired) and two-sample t-tests, and
ANOVA, that make assumptions about the distribution
of the data (i.e. normally distributed data).
• These methods are called parametric because they are
based on distributions that are defined by parameters.
• For example, normal distributions are defined by two
parameters and
3
I. Parametric or Nonparametric Test?
Background
If the assumptions needed to use these distributions are
violated then two general approaches can be used:
1. Transform the data (y and/or x variables) so that the
distribution appears more normal. Log transformation
and square root are common. Data is then analyzed
using parametric methods that assume normality.
2. Use nonparametric methods. Nonparametric methods
make fewer and more generic assumptions about the
distribution of the data.
In this lecture, will discuss some commonly used
nonparametric tests.
4
I. Parametric or Nonparametric Test?
6
I. Parametric or Nonparametric Test?
NOTE: The S-W test can be overly sensitive for a large sample
size and underpowered for a small sample size. This test may
reject the null hypothesis of normality when a parametric test
is appropriate or may fail to reject the hypothesis of normality
when a nonparametric test is appropriate.
7
I. Parametric or Nonparametric Test?
8
Overview
Parametric Nonparametric
One sample One sample t-test 1. Wilcoxon’s signed
rank test
10
II. One-sample and Paired Data
11
II. One-sample and Paired Data
hist(diff)
qqnorm(diff)
qqline(diff)
shapiro.test(diff)
16
II. One-sample and Paired Data
Diet Example
Biggest reason for
nonparametric test here is
small sample size (n=10)
17
II. One-sample and Paired Data
18
II. One-sample and Paired Data
In R
• One sample
wilcox.test(x, mu = 0)
19
II. One-sample and Paired Data
Two Independent
Samples
22
III. Two Independent Samples
23
III. Two Independent Samples
Note, to use the Wilcoxon rank sum test as a test of medians, the
two population distributions from which the groups are sampled
must have the same shape.
24
III. Two Independent Samples
Observed data:
A: 1 7 11 16
B: 8 10 12 15
26
Class 1II: Nonparametric Statistics
29
III. Two Independent Samples
30
III. Two Independent Samples
33
IV. Comparing More than Two Groups
Pearson correlation:
Null hypothesis (Ho) in words: There is no linear association between x and y. (ρ=0)
Alternative hypothesis (Ha) in words: There is a linear association between x and y.
(ρ≠0)
P-VALUE
Correlation coefficient
P-VALUE
Correlation coefficient
Spearman:
Conclusion: Since the p-value of 0.0013 is less than α=0.05, we reject the null
hypothesis and conclude that there is significant evidence of a strong positive
monotone association between x and y (r = 0.916, p-value = 0.0013).
ANOVA in R
• Formula for one-way ANOVA: y ~ group
> fit <- aov(days~treatment, data=ds)
> summary(fit)
Model
Error
Model
Error
Total
43
Conclusion: Since the p-value of ?? is greater than α=0.05, we fail
to reject Ho and conclude that there is no significant difference in
mean number of days to heal (replace with the true meaning of
response in your case) across the different treatment (replace
with the true meaning of groups in your case) groups
Example
ds = read.csv("fev.csv")
dim(ds)
head(ds)
summary(ds$age)
ds$smoking = factor(ds$smoking)
res=lm(fev ~ age + smoking, data=ds)
summary(res)
confint(res)
47
I. Multivariable Linear Regression
DF for t-statistics
t-statistic follows a t-distribution with degree of freedom (n-p-1) 48
I. Multivariable Linear Regression
50
I. Multivariable Linear Regression
Prediction
• We can predict FEV from a combination of age and
smoking status.
SMOKING
51
Is it appropriate to predict FEV for a 30-year old non-
smoker?
No, this is not reasonable. Since the study was conducted among
individuals aged 3–19 years, it is inappropriate to extrapolate the
findings beyond this age range. Therefore, applying the model
results to a 30-year-old, who is outside the study's maximum age
limit, would not be valid.
4. Logistic regression
(1) How to write r code to get multiple logistic regression results?
(2) Based on Odds ratio (OR) and 95% CI of OR, interpret the association between
x and y. How to interpret OR is important!
IV. Multivariable Logistic Regression
54
IV. Multivariable Logistic Regression
Interpretation
of results?? 55
IV. Multivariable Logistic Regression
56
IV. Multivariable Logistic Regression
• Or we can say: The odds ratio is 0.93. Each year increase in age is
associated with a 7% decrease in the odds of having a coronary
abnormality, after adjusting for treatment group. But this difference is not
significant. 57
What is the odds ratio for developing CA that is associated
with a 5 years old increase in age?
OR = e(-0.07*5) = 0.7
60
II. Effect Modification and Interaction
Interaction
• Interaction means that the effect of a predictor x1
on the outcome differs according to the level of
another predictor x2
61
II. Effect Modification and Interaction
HERS Example
• Does the effect of hormone therapy (predictor) on
LDL (outcome) cholesterol differ according to baseline
statin use?
62
II. Effect Modification and Interaction
63
II. Effect Modification and Interaction
Or
64
II. Effect Modification and Interaction
65
II. Effect Modification and Interaction