0% found this document useful (0 votes)
10 views

Week 7 and Week 8

Uploaded by

g-sk5103tmp05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Week 7 and Week 8

Uploaded by

g-sk5103tmp05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

R PROGRAMMING

R
 To install R: Go to http://cran.r-project.org/
 Documentations on R: http://cran.r-
project.org/doc/manuals
 Based on S language
 Rcmdr package
 Combines R with an easy-to-use drop-down menu
 Provides the lines of code that correspond to the
analyses carried out
 Install and launch using library(Rcmdr)
Import Data
 To import data from the file myfile.csv and to place
it within an object dat:
 dat<- read.table(“myfile.csv”, sep=“,”, header=TRUE,
dec=“.”)
 summary(dat)

 Commands in Weeks 5 and 6 as well as in S+


Hands-On Instructions 2 can be used in R
Two-Sample t-Test
 Case Study1: The file Octopus.csv contains the
weights of fifteen male and thirteen female adult
octopuses fished off the coast of Mauritania. It is of
interest to compare the weights of the male and
female octopuses.
octopus <- read.table(“Octopus.csv”, header=T, sep=“;”)
summary (octopus)
boxplot(Weight ~ Sex, ylab=“Weight”, xlab=“Sex”,
data=octopus)
tapply(octopus[,”Weight”], octopus[,”Sex”], mean,
na.rm = TRUE)
Two-Sample t-Test (cont.)

select.males <- octopus[,”Sex”]==“Male”


qqnorm(octopus[select.males, “Weight”])
qqline(octopus[select.males, ”Weight”], col = “grey”)
shapiro.test(octopus[select.males, “Weight”])

var.test(Weight ~ Sex, conf.level=0.95, data=octopus)


t.test(Weight ~ Sex, alternative=‘two.sided’, conf.level =
0.95, var.equal=FALSE, data=octopus)

 Non-parametric tests: Wilcoxon (wilcox.test) and


Kruskal-Wallis (kruskal.test)
Power of A Test
 Case Study 2: Researchers at INRA want to detect
a potential difference in the protein levels in the
milks produced by two genetically different types
of dairy cows. During a previous study, the
standard deviation of protein levels in the milk from
a herd of Normandy cows was found to be 1.7g/kg
of milk. Use the classical 𝛼 = 0.05 threshold. The
aim is to have 𝛽 = 80% chance of detecting a
difference in the means of protein levels of 𝛿 = 1
g/kg of milk from the two populations.
Power of A Test (cont.)

# Number of individuals required for a power of 80%


power.t.test(delta = 1, sd = 1.7, sig.level = 0.05, power =
0.8)

# Power of the test with twenty individuals per group


power.t.test(n=20, delta = 1, sd = 1.7, sig.level =
0.05)$power

# Difference detectable at 80% with twenty individuals


per group
power.t.test(n=20, sd = 1.7, sig.level = 0.05,
power=0.8)$delta
Power of A Test (cont.)
 Case Study 2 (cont.): Now, consider that there are
three genetic populations. Again, assume that the
residual standard deviation of the protein level is
1.7g/kg of milk. The aim is to detect (with a
threshold of 5%) whether there really is a
difference between the three populations if the true
means of protein levels are 28, 30 and 31 g/kg,
respectively.
x<-1.7^2
y<-var(c(28,30,31))
power.anova.test(groups=3, between.var=y, within.var=x,
power=0.80)
Multiple Linear Regression
 Case Study 3: It is of interest to analyse the
relationship between the maximum daily ozone
level (in 𝜇g/m3) and temperature at different times
of day, cloud cover at different times of day, the
wind projection on the East-West axis at different
times of day and the maximum ozone concentration
for the day before the day in question. The data
was collected during the summer of 2001 and the
sample size is 112.
Multiple Linear Regression (cont.)
# Reading the data
ozone <- read.table("http://www.agrocampus-
ouest.fr/math/RforStat/ozone.txt",header=T)
ozone.m <- ozone[,1:11]
names(ozone.m) # Try colnames(ozone.m)

# Summarising the variables


summary(ozone.m)
pairs(ozone.m)

# Estimating the parameters


reg.mul <- lm(maxO3~.,data=ozone.m)
summary(reg.mul)
Multiple Linear Regression (cont.)
15 25 15 30 0 4 8 -8 -2 4 -8 0

160
maxO3

40
T9
15

15 30
T12
35

T15
15

6
Ne9

0
6

Ne12
0

6
Ne15

0
-8 0

Wx9

5
Wx12

-5
2

Wx15
-8

160
maxO3y

40
40 120 15 25 0 4 8 0 4 8 -5 5 40 120
Multiple Linear Regression (cont.)
Call:
lm(formula = maxO3 ~ ., data = ozone.m)

Residuals:
Min 1Q Median 3Q Max
-53.566 -8.727 -0.403 7.599 39.458

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.24442 13.47190 0.909 0.3656
T9 -0.01901 1.12515 -0.017 0.9866
T12 2.22115 1.43294 1.550 0.1243
T15 0.55853 1.14464 0.488 0.6266
Ne9 -2.18909 0.93824 -2.333 0.0216 *
Ne12 -0.42102 1.36766 -0.308 0.7588
Ne15 0.18373 1.00279 0.183 0.8550
Wx9 0.94791 0.91228 1.039 0.3013
Wx12 0.03120 1.05523 0.030 0.9765
Wx15 0.41859 0.91568 0.457 0.6486
maxO3y 0.35198 0.06289 5.597 1.88e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14.36 on 101 degrees of freedom


Multiple R-squared: 0.7638, Adjusted R-squared: 0.7405
F-statistic: 32.67 on 10 and 101 DF, p-value: < 2.2e-16
Multiple Linear Regression (cont.)

# Choosing the variables


library(leaps)
choice <- regsubsets(maxO3~.,data=ozone.m,nbest=1,
nvmax=11)
plot(choice,scale="bic")
summary(choice)$which[which.min(summary(choice)$
bic),]
final.reg <- lm(maxO3~T12+Ne9+Wx9+maxO3y,
data=ozone.m)
summary(final.reg)
Multiple Linear Regression (cont.)

-140
-140
-130
-130
-120
bic

-120
-120
-110
-110
-97
T9

T12

T15

Ne9

Ne12

Ne15

Wx9

Wx12

Wx15

maxO3y
(Intercept)
Multiple Linear Regression (cont.)
Call:
lm(formula = maxO3 ~ T12 + Ne9 + Wx9 + maxO3y, data = ozone.m)

Residuals:
Min 1Q Median 3Q Max
-52.396 -8.377 -1.086 7.951 40.933

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.63131 11.00088 1.148 0.253443
T12 2.76409 0.47450 5.825 6.07e-08 ***
Ne9 -2.51540 0.67585 -3.722 0.000317 ***
Wx9 1.29286 0.60218 2.147 0.034055 *
maxO3y 0.35483 0.05789 6.130 1.50e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 14 on 107 degrees of freedom


Multiple R-squared: 0.7622, Adjusted R-squared: 0.7533
F-statistic: 85.75 on 4 and 107 DF, p-value: < 2.2e-16
Multiple Linear Regression (cont.)

# Conducting a residual analysis


res.m <- rstudent(final.reg)
plot(res.m,pch=15,cex=.5,ylab="Residuals")
abline(h=c(-2,0,2),lty=c(2,1,2))

# Predicting a new value


xnew <- matrix(c(19,8,2.05,70),nrow=1)
colnames(xnew) <- c("T12","Ne9","Wx9","maxO3y")
xnew <- as.data.frame(xnew)
predict(final.reg,xnew,interval="pred")
Multiple Linear Regression (cont.)

2
Residuals

0
-2
-4

0 20 40 60 80 100

Index

fit lwr upr


1 72.51437 43.80638 101.2224
One-Way ANOVA
 A statistical method to model the relationship between
an explanatory qualitative variable (factor) A having I
categories (levels) with a quantitative response variable
Y.
 The main objective of ANOVA is to compare the
empirical means of Y for I levels of A.
 The model:
𝑦𝑖𝑗 = 𝜇𝑖 + 𝜀𝑖𝑗 , 𝑖 = 1, … , 𝐼, 𝑗 = 1, … , 𝑛𝑖
or 𝑦𝑖𝑗 = 𝜇 + 𝛼𝑖 + 𝜀𝑖𝑗
where 𝑛𝑖 is the sample size of category 𝑖, 𝑦𝑖𝑗 the
observed value 𝑗 for subpopulation 𝑖, 𝜀𝑖𝑗 the residual of
the model, 𝜇 is the overall mean and 𝛼𝑖 is the special
effect of category 𝑖.
One-Way ANOVA (cont.)
 Hypotheses of the test:
𝐻0 : ∀𝑖 𝛼𝑖 = 0
𝐻1 : ∃𝑖 𝛼𝑖 ≠ 0

 A residual analysis is essential to check the


individual fit of the model (outlier) and the overall
fit.
One-Way ANOVA (cont.)
 Case Study 4: Analyse the relationship between
the maximum daily ozone concentration (in 𝜇g/m3)
and wind direction classed by sector (North, South,
East, West). The variable wind has 𝐼 = 4 levels.
There are 112 pieces of data collected during the
summer of 2001 in Rennes (France).
One-Way ANOVA (cont.)
# Reading the data
ozone <- read.table("http://www.agrocampus-
ouest.fr/math/RforStat/ozone.txt",header=T)
summary(ozone[,c("maxO3","wind")])

# Representing the data


plot(maxO3~wind,data=ozone,pch=15,cex=.5)

# Analysing the significance of the factor


reg.aov1 <- lm(maxO3~wind,data=ozone)
anova(reg.aov1)
160
140
120
maxO3

100
80
60
40

East North South West

wind
Analysis of Variance Table

Response: maxO3
Df Sum Sq Mean Sq F value Pr(>F)
wind 3 7586 2528.69 3.3881 0.02074 *
Residuals 108 80606 746.35
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
One-Way ANOVA (cont.)
# Conducting a residual analysis
res.aov1 <- rstudent(reg.aov1)
library(lattice)
mypanel <- function(...){
panel.xyplot(...)
panel.abline(h=c(-2,0,2),lty=c(3,2,3),...)
}
trellis.par.set(list(fontsize=list(point=5,text=8)))
xyplot(res.aov1~I(1:112)|wind,data=ozone,pch=20,ylim
=c(-3,3),panel=mypanel, ylab="Residuals",xlab="")
One-Way ANOVA (cont.)
0 20 40 60 80 100

South West

-1

-2
Residuals

East North

-1

-2

0 20 40 60 80 100
One-Way ANOVA (cont.)
# Interpreting the coefficients
summary(reg.aov1)
summary(lm(maxO3~C(wind,base=2),data=ozone))
summary(lm(maxO3~C(wind,sum),data=ozone))

options(contrasts = c("contr.sum", "contr.sum"))


summary(lm(maxO3~wind,data=ozone))
One-Way ANOVA (cont.)
Call:
lm(formula = maxO3 ~ wind, data = ozone)

Residuals:
Min 1Q Median 3Q Max
-60.600 -16.807 -7.365 11.478 81.300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.600 8.639 12.223 <2e-16 ***
windNorth -19.471 9.935 -1.960 0.0526 .
windSouth -3.076 10.496 -0.293 0.7700
windWest -20.900 9.464 -2.208 0.0293 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.32 on 108 degrees of freedom


Multiple R-squared: 0.08602, Adjusted R-squared: 0.06063
F-statistic: 3.388 on 3 and 108 DF, p-value: 0.02074
One-Way ANOVA (cont.)
Call:
lm(formula = maxO3 ~ C(wind, base = 2), data = ozone)

Residuals:
Min 1Q Median 3Q Max
-60.600 -16.807 -7.365 11.478 81.300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 86.129 4.907 17.553 <2e-16 ***
C(wind, base = 2)1 19.471 9.935 1.960 0.0526 .
C(wind, base = 2)3 16.395 7.721 2.123 0.0360 *
C(wind, base = 2)4 -1.429 6.245 -0.229 0.8194
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.32 on 108 degrees of freedom


Multiple R-squared: 0.08602, Adjusted R-squared: 0.06063
F-statistic: 3.388 on 3 and 108 DF, p-value: 0.02074
One-Way ANOVA (cont.)
Call:
lm(formula = maxO3 ~ C(wind, sum), data = ozone)

Residuals:
Min 1Q Median 3Q Max
-60.600 -16.807 -7.365 11.478 81.300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.738 3.053 31.027 <2e-16 ***
C(wind, sum)1 10.862 6.829 1.590 0.1147
C(wind, sum)2 -8.609 4.622 -1.863 0.0652 .
C(wind, sum)3 7.786 5.205 1.496 0.1376
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.32 on 108 degrees of freedom


Multiple R-squared: 0.08602, Adjusted R-squared: 0.06063
F-statistic: 3.388 on 3 and 108 DF, p-value: 0.02074
One-Way ANOVA (cont.)
Call:
lm(formula = maxO3 ~ wind, data = ozone)

Residuals:
Min 1Q Median 3Q Max
-60.600 -16.807 -7.365 11.478 81.300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 94.738 3.053 31.027 <2e-16 ***
wind1 10.862 6.829 1.590 0.1147
wind2 -8.609 4.622 -1.863 0.0652 .
wind3 7.786 5.205 1.496 0.1376
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 27.32 on 108 degrees of freedom


Multiple R-squared: 0.08602, Adjusted R-squared: 0.06063
F-statistic: 3.388 on 3 and 108 DF, p-value: 0.02074

You might also like