Categorical Data Courses
Categorical Data Courses
Mariangela Sciandra
1
Section 1: Dealing with categorical data
2
Basic R data types
3
Categorical data: factor vs character data type
Categorical data can be defined in R both as character
or factor.
A character type object is a characters sequence (let-
ters, numbers etc) representing labels;
A factor is the most general data type. Factors are also
called categories or enumerated types.
Think of a factor as a set of category names. Factors
are qualitative classification of objects. Categories do
not imply order. A black snake is different from a brown
snake. It is neither larger nor smaller.
The set of values that the elements of a factor can take
are called its levels.
Examples of categorical data are:
4. species
5. color of flowers
warning message:
In model.matrix.default(mt, mf, contrasts) :
variable ’character_x’ converted to a factor
5
How to Create a Factor in R: the factor() function
6
More arguments:
• labels:
– it allows to define the labels for the levels
7
Example:
data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata
levels(fdata)
fdata1 = factor(data,labels=c("I","II","III"))
fdata2
levels(fdata) = c(’I’,’II’,’III’)
datanew<-unclass(fdata)
datanew
8
Ordered factors
9
In order to create ordered factors in R, the argument
ordered has to be specified.
10
The gl() function
where
11
Examples
degree<-gl(6,2,12)
degree
degree<-gl(6,2,12, labels=c("ness", "elem", "media",
"super","laurea", "speci"))
degree
gender<-gl(n=2,k=2,length=15,labels=c("M","F"))
gr<-gl(n=4,k=1,length=16,
labels=c("R.Flood","R.Ctrl","E.Flood","E.Ctrl"))
gr
12
Making numeric data categorical: the cut()
function
13
The cut() function has two basic arguments: x a nume-
ric vector and breaks a vector of breakpoints;
14
It is a common mistake to believe that the outer break-
points can be omitted but the result for a value outside
all intervals is set to NA; for example:
data(iris)
iris$Sepal.Length
SepLeng.categ<-cut(iris$Sepal.Length, breaks=c(4,5,6,7))
SepLeng.categ
The intervals are left-open, right-closed by default. That
is they include the breakpoint at the right end of each
interval.
data(women)
wfact = cut(women$weight,3)
table(wfact)
If we want to use real number as labels for the new
classes then we can write
wfact = cut(women$weight,3,labels=FALSE)
table(wfact)
The lowest breakpoint is not included unless you set
include.lowest=TRUE, making the first interval closed at
both ends.
wfact = cut(women$weight,3,include.lowest=TRUE)
table(wfact)
Moreover, if you want to set you interval right open you
need to specify right=FALSE.
wfact = cut(women$weight,3,right=FALSE)
table(wfact)
15
Examples
age<-c(2,15,31,26,12,32,34,70,81,5,3)
age.cat<-cut(age, breaks=c(0,18,45,90),
labels=c("young","adult","old"))
eta.cat
library(ISwR)
data(juul)
age<-subset(juul,age>=10 & age<=16)$age
range(age)
agegr<-cut(age,seq(10,16,2),right=FALSE,
include.lowest=TRUE)
length(age)
table(agegr)
agegr2<-cut(age,seq(10,16,2),right=FALSE)
table(agear2)
16
It is sometimes desired to split data into roughly-sized
group.
q<-quantile(age, probs=c(0,0.25,0.50,0.75,1))
q
ageQ<-cut(age,q,include.lowest=TRUE)
table(ageQ)
The levels names resulting from cut turn out rather ugly
at times. Fortunately they are easily changed.
levels(ageQ)<-c("1st","2nd","3rd","4th")
levels(age)<-c("10-11","12-13","14-15")
17
How to create an array: the array() function
18
In order to extract an element we use the same syntax
used for matrices and vectors, with the only difference
that will now have to be specified 3 indices:
Titanic
dim(Titanic)
19
Scalar product in R: the crossprod() function
M<-matrix(,nrow=3,ncol=4)
M[]<-1:12
G<-matrix(,nrow=3,ncol=4)
G[]<-25:36
crossprod(M,G)
t(M)%*%G
while
crossprod(M)
t(M)%*%M
20
The Hadamard product of matrices
M*G
21
The Kronecker product: kron(), kronecker()
Let A = [aij ] and B = [bij ] be two matrices of arbitrary
sizes (m × n) and (p × q), rispettivamente.
Then the Kronecker product of these two matrices is a
(mp × nq) block matrix defined as:
a11 B . . . a1n B
a21B . . . a2nB
A⊗B= ... ... ...
am1 B . . . amn B
library(fBasics)
K<-kron(M,H)
dim(K)
M<-matrix(c(1,2,3,1),2,2,byrow=TRUE)
G<-matrix(c(0,3,2,1),2,2,byrow=TRUE)
kron(M,G)
kronecker(M,G)
22
Section 2: Creating and manipulating two ways
frequency tables
23
I × J contingency tables: joint distribution
color.eyes<-c("B","D","B","D","D","D","B","B","D","D","D")
color.hair<- c("Dark","Dark","Dark","Blond","Blond",
"Blond","Blond","Dark","Dark","Dark","Dark")
table(color.eyes,color.hair)
24
Some examples
color.eyes<-c("B","D","B","D","D","D","B","B","D","D","D")
color.hair<- c("Dark","Dark","Dark","Blond","Blond",
"Blond","Blond","Dark","Dark","Dark","Dark")
gender<-gl(n=2,k=1,length=11,labels=c("M","F"))
city<-gl(n=3,k=1,length=11,labels=c("PA","CT","AG"))
Bivariate distributions
table(color.eyes,city)#2x3
table(color.eyes,color.hair) #2x2
table(color.eyes,gender) #2x2
table(color.hair,gender) #2x2
table(color.hair,city) #2x3
table(gender,city) #2x3
Trivariate distributions
table(color.eyes,color.hair,city,gender)
25
The way to create a contingency table changes if we
are not dealing with individual observations but we only
know the joint distribution; in this case a contingency
table is obtained properly defining a matrix:
Pauling,1971 example:
Table.P<-matrix(c(31,17,109,122),2,2)
dimnames(Table.P)<-list(c("Placebo","Ascorbic Acid"),
c("Yes","No"))
Table.P
Tonsil example
Tons<-matrix(c(53,19,829,497),2,2)
dimnames(Tons)<- list(c("Enlarged", "Not enlarged"),
c("Carrier","Not carrier"))
Tons
Suicidal tendencies example
Suicide<-matrix(c(26,39,39,20,27,27,195,93,34)
,3,3,byrow=TRUE)
dimnames(Suicide)<- list(c("Attempted","Contemplated",
"Nothing"),c("Healthy","moderately depressed",
"severely depressed"))
Suicidio
Death penalty and race characteristics example
Agresti<-array(c(19,0,132,9,11,6,52,97),c(2,2,2))
dimnames(Agresti)<- list(Victimrace=c("White","Black"),
DeathPenalty=c("Yes","NO"),
Defendantrace=c("White", "Black"))
Agresti
26
Marginal and conditional distributions
27
Examples:
Suicidio[1,]
Tons[,1]
Table.P[1,]
margin.table(Suicidio,1)
margin.table(Table.P,2)
28
The margin.table() function works also with arrays:
margin.table(Agresti,2)
margin.table(Table.P)
margin.table(Tons)
margin.table(Suicidio)
margin.table(Agresti)
29
Relative frequency tables: the prop.table()
function
When using prop.table() on a multidimensional table,
it’s necessary to specify which marginal sums you want
to use to calculate the proportions.
To use the row sums, specify 1; to use the column sums,
specify 2 and so on.
If the second argument is not specified then the re-
lative frequencies are the result of the ratio between
the frequencies in each cell and the total number of
observations N!
prop.table(Suicidio)
• Row percentages
prop.table(Suicidio,1)
• Column percentages
prop.table(Suicidio,2)
31
Let Xn,k be the data matrix of k variables observed on n
statistical units, when one or more of the k variable are
categorical, these can be represented using a different
coding.
X<-matrix(c(1,1,0,1,0,0,0,1,0,1),5,2)
Z<-matrix(c(1,0,0,0,0,0,0,1,0,1,0,1,0,1,0),5,3)
X
Z
crossprod(X,Z)
33
Comparison between FDC and cross coding
Using the same example compare the two coding me-
thods:
regioni<-c("N","C","C","S","N")
cdc(regioni)
35
Example and exercises
regions <-factor(c(rep("N",12608),rep("C",5206),
rep("S",7192)))
table(employed, regions)
X1 <- cdc(employed, names=c("D","O"))
X2 <- cdc(regions)
Exercises:
36
Cross coding function by Marco Ventimiglia
cross<-function(a,b){
if(length(a)!=length(b))
stop(’the two vectors MUST
have the same length!’)
a<-factor(a)
b<-factor(b)
d<-paste(a,b)
lev.a<-levels(a)
lev.b<-levels(b)
nla<-length(lev.a)
nlb<-length(lev.b)
cat.n<-matrix(,nrow=nla,ncol=nlb)
for(i in 1:nla)
for(j in 1:nlb)
cat.n[i,j]<-paste(lev.a[i],lev.b[j])
names<-c(cat.n)
n.obs<-length(a)
n.col=nla*nlb
out<-matrix(nrow=n.obs,ncol=nla*nlb)
for(k in 1:n.obs)
for(l in 1:n.col)
ifelse(d[k]==names[l],out[k,l]<-1,out[k,l]<-0)
colnames(out)<-names
return(out)
}
37
Visualizing data with many dimensions
When the number of variables to analyse increases the
use of the array() function creates some problems (ove-
rall in visualizing them).
An alternative method consists in representing data in
the lexicographical order; it consists in reorganizing
the data in a matrix where the rows are all the possible
combinations of the categories of the variables.
The as.data.frame() function applied to an array objects
allows to do that.
For example look at the array HairEyeColor available
in R:
is.array(HairEyeColor)
To print the results more attractively we could use the
ftable() function, that allows to flat the table table:
ftable(HairEyeColor)
A different output is obtained if we use the as.data.frame()
function because now a number of column will be create
given by the total combination of levels of the factors.
HEC<-as.data.frame(HairEyeColor)
The last column will contain the frequencies and will be
automatically created while the other columns will take
the names and the features from the original array.
38
The opposite operation, that is to create an array from
a table organized in lexicographical order, can be imple-
mented using the xtabs() function with arguments the
original dataframe and the “formula” where the variable
of interest are specified.
HEC<-as.data.frame(HairEyeColor)
xtabs(Freq~Hair+Eye+Sex,data=HEC)
library(vcd)
data(Arthritis)
art <- xtabs(~ Treatment + Improved, data = Arthritis)
art
The expand.grid() function
schizo.counts<-c(90,12,78,13,1,6,19,13,50)
schizo.table<-cbind(expand.grid(list(
Origin=c("Biogenic","Environmental","Combination"),
School=c("Eclectic","Medical","Psychoanalytic"))),
count=schizo.counts)
religion.counts<-c(178,570,138,138,648,252,108,442,252)
Tab<-cbind(expand.grid(list(Highest.Degree=c("<HS",
"HS or JH","Bachelor or Grad"),Religious.Beliefs=
c("Fund", "Mod", "Lib"))),count=religion.counts)
Tab
39
Section 4: Simulating sampling schemes for 2 × 2
tables
40
Two-ways contingency tables can be the result of dif-
ferent generator processes and then the result of a dif-
ferent sampling scheme according to the constrained
specified. So the interest will be in identify the table
generating process where the cells will become random
variables and the probability distribution of the table will
depend on the sampling scheme.
41
1. Poisson Sampling:
• in a Poisson sampling scheme counts are obser-
ved over fixed time interval or spatial domain;
• the probability of the event is approximately pro-
portional to length of time for small intervals of
time;
• for small intervals of time probability of more
than one occurrence is neglibile compared to
probability of one event;
• the numbers of events in non-overlapping time
intervals are independent.
In this scheme none of the marginal totals (or
grand total) are known!
Then, the four cells as well as the gran total N are
random variables.
How to simulate a Poisson sampling scheme
in R
data1<-rpois(4,lambda=5)
table.pois1<-matrix(data1,2,2)
data2<-rpois(4,lambda=1)
table.pois2<-matrix(data2,2,2)
table.pois1
table.pois2
42
2. Multinomial Sampling:
Let consider now to collect data on a predetermi-
ned number of individuals and classify them accor-
ding to two binary variables (e.g. treatment and
response).
This scheme is related the Poisson scheme, except
the grand total N is a fixed quantity.
As in the Poisson scheme, each subject sampled
falls into one of the four cells of the table.
Numbers in the cells are from a multinomial distri-
bution.
How to simulate a Multinomial sampling
scheme in R
N=30
data<-rmultinom(1,size=N,prob=c(.25,.25,.25,.25))
table.mult<-matrix(data,2,2)
x<-rmultinom(n=1,size=30,prob=c(.125,.125,.375,.375))
tabella.multinom<-matrix(x,2,2,byrow=TRUE)
tabella.multinom
margin.table(tabella.multinom)
n.1<-20
n.2<-10
col1<-rmultinom(n=1,size=n.1,prob=c(.5,.5))
col2<-rmultinom(n=1,size=n.2,prob=c(.5,.5))
tabella.prodmultinom<-cbind(col1,col2)
tabella.prodmultinom
margin.table(tabella.prodmultinom,2)
4. Hypergeometric sampling:
all the row and column totals are fixed!
The best-known example of this type of process
is the Fisher’s example of the “Lady TastingTea”,
which we will discuss when we will talk about Exact
Tests.
In a 2 × 2 table, the resulting sampling distribution
is hypergeometric.
Note that even when both the row and column mar-
gins are fixed, it does not imply that all cell frequen-
cies are fixed! (You may create a small 2 × 2 with
row total (5, 3) and column totals (4, 4) and see
for yourself what different tables are possible).
If all totals are fixed just one cell will be random
because once you know one of the joint frequen-
cies remaining values in the table can be uniquely
defined.
This cell will be a realization of a hypergeometric
distribution: it will be a random variable with boun-
ded support as the cell frequency cannot be greater
than frequencies of the respective row and column
margins.
How to simulate a hypergeometric sampling
scheme in R
rhyper(nn,m,n,k)
where
• nn = is the number of random numbers
• m = max(n1. ,n.1 )
• n = n.. - max(n1. ,n.1 )
• k = min(n1. ,n.1 )
Example
rhyper(1,50,20,15)
12
n11<-rhyper(1,15,15,10)
n12<-10-n11
n21<-15-n11
n22<-15-n12
X<-matrix(c(n11,n12,n21,n22),2,2,byrow=TRUE)
dhyper(n11,15,15,10)
or
dhyper(n11,max((n11+n12),(n11+n21)),
(n11+n12+n21+n22)-max((n11+n12),
(n11+n21)),min((n11+n12),(n11+n21)))
Types of study designs: real examples
44
Inference for a single proportion
The function prop.test() will carry out test of hypo-
theses and produce confidence intervals in problems in-
volving one or several proportions. In the example con-
cerning opinion on abortion, there were 424 “yes” re-
sponses out of 950 subjects. Here is one way to use
prop.test() to analyze these data:
prop.test(424,950)
Note that by default:
prop.test(phs,correct=F)
You can also save the output of the test and manipulate
it in various ways:
phs.test$estimate[1]/phs.test$estimate[2]
46
Relative risk and Odds ratio
phs.test$estimate
odds <- phs.test$estimate/(1-phs.test$estimate)
odds[1]/odds[2]
# as cross-prod ratio
(phs[1,1]*phs[2,2])/(phs[2,1]*phs[1,2])
47
It is easy to write a quick and dirty function to do these
calculations for a 2 × 2 table.
odds.ratio <-
function(x, pad.zeros=FALSE, conf.level=0.95) {
if (pad.zeros) {
if (any(x==0)) x <- x + 0.5
}
theta <- x[1,1] * x[2,2] / ( x[2,1] * x[1,2] )
ASE <- sqrt(sum(1/x))
CI <- exp(log(theta)
+ c(-1,1) * qnorm(0.5*(1+conf.level)) *ASE )
list(estimator=theta,
ASE=ASE,
conf.interval=CI,
conf.level=conf.level)
}
odds.ratio(phs)
Relationship between RR and OR
Let RE be the risk (incidence) of exposed and let RN E
be the incidence of no-exposed; then OR is
RE
1−RE RE 1 − RN E RE 1 − RN E 1 − RN E
OR = RN E
= ∗ = ∗ = RR ∗
1−RN E
1 − RE RN E RN E 1 − RE 1 − RE
48
The OR well approximates the RR when the disease
incidence is low, because in this case, RN and RE are
very small and the coefficient is very close to 1.
Examples:
A<-matrix(c(10,50,2,50),2,2,byrow=TRUE,
dimnames = list(c("Exposed", "No-exposed"),
c("Sick", "Healthy")))
RE<-10/60
RNE<-2/52
RRA<-RE/RNE
ORA<-oddsratio(A,log=FALSE)
c(RRA,ORA)
B<-matrix(c(10,10000,2,10000),2,2,byrow=TRUE,
dimnames = list(c("Exposed", "No-exposed"),
c("Sick", "Healthy")))
RE<-10/10010
RNE<-2/10002
RRB<-RE/RNE
ORB<-oddsratio(B,log=FALSE)
c(RRB,ORB)
49
Contingencies and Chi-Squared Tests of Indepen-
dence
The chisq.test() function will compute Pearson’s chi-
squared test statistic (X 2 ) and the corresponding P-
value.
For the Job-satisfaction example:
jobsatis <- c(2,4,13,3, 2,6,22,4, 0,1,15,8, 0,3,13,8)
jobsatis <- matrix(jobsatis,byrow=TRUE,nrow=4)
dimnames(jobsatis) <- list(
Income=c("<5","5-15","15-25",">25"),
Satisfac=c("VD","LS","MS","VS"))
jobsatis
chisq.test(jobsatis)
In case you are worried about the chi-squared appro-
ximation to the sampling distribution of the statistic,
you can use simulation to compute an approximate P-
value (or use an exact test). The argument B (default
is 2000) controls how many simulated tables are used
to compute this value. It is interesting to do it a few
times though to see how stable the simulated P-value
is (does it change much from run to run). In this ca-
se the simulated P-values agree closely with the chi-
squared approximation, suggesting that the chi-squared
approximation is good in this example
chisq.test(jobsatis,simulate.p.value=TRUE,B=10000)
In this case the P-value agrees closely with the simulated
P-values.
50
Example:
gender<-c("M","F")
aptitude<-c("A","B","C")
data<-matrix(c(35,22,40,27,44,51),2,3,
dimnames=list(gender,aptitude))
tab<-as.table(data)
chi<-summary(tab)$statistic
chi
Exercise 1
In a retrospective study aimed to determine whether
coffee consumption affects the risk of myocardial infarc-
tion, data were collected on patients living in a small to-
wn. Coffee consumption has been classified as low (<=
3 cups a day) or high (> 3 cups a day). The results are
shown in the table.
Coffee consumption Infarction YES Infarction NO TOT
High 78 1322 1400
Low 48 1352 1400
TOT 126 2674 2800
Evaluate the relative risk and the odds ratio (to assess
whether the high consumption of coffee increases the
risk of myocardial compared to low consumption).
We calculate separately the risk for those that make
little (LW) and high (HG) consumption of coffee:
RLW=48/1400
RHG=78/1400
RR=RHG/RLW
RR [1] 1.625
Odds ratio:
coffe<-matrix(c(78,48,1322,1352),2,2)
oddsratio(coffe,log=FALSE)
[1] 1.661876
51
The odds ratio amplifies the measure of association with
respect to the relative risk; it is further away from 1 with
respect to the relative risk.
StudioA<-matrix(c(5,1,495,499),byrow=FALSE,2,2)
StudioB<-matrix(c(100,30,300,570),byrow=FALSE,2,2)
RRA<-(100/400)/(30/600)
RRB<-(5/500)/(1/500)
library(vcd)
OA<-oddsratio(StudioA,log=FALSE)
OB<-oddsratio(StudioB,log=FALSE)
Sterile E. coli
Liver Tumor 19 8
No Tumor 30 5
53
3. Each person in a sample of 276 healthy adult volun-
teers was asked about the variety of social networks
that they were in (e.g., relationships with parents,
close neighbors, workmates, etc.). They were then
given nasal drops containing a rhinovirus and were
quarantined for 5 days. Numbers of subjects who
developed colds were recorded in the table below.
54
Generalized Linear Models
55
Six possible families can be chosen:
options(’contrasts’)
$contrasts
unordered ordered
"contr.treatment" "contr.poly"
56
You can change these contrasts using the same options()
function, like this:
options(contrasts=c(’contr.sum’,’contr.poly’))
57
Glm residuals in R
There are many different kinds of residuals for GLMs.
If you were coming to this from an OLS perspective, you
would expect that the residual would be yi − ŷi . In the
GLM framework, you are tempted to replace ŷi with µ̂i .
However, that’s not correct because µ̂i is a prediction
about the mean, while yi is an observed value.
Think of a logit model, where the “predicted probabi-
lity” is the µ̂i . Is the residual equal to the difference
between the observed 1 (or 0) and µ̂i ?
Several alternative residuals have been recommended.
Notice that if you fit a model in R, and then want the
residuals, you can specify 5 kinds of residuals.
help(residuals.glm)
mod<-glm(y~x,family=poisson)
58
residuals(mod,"response")
y-fitted(mod)
residual(mod,type="pearson")
(y-fitted(mod))/sqrt(fitted(mod))
residuals(mod,type="deviance")
hat.mu<- exp(mod$linear)
poisson.dev <- function (y, mu)
sqrt(2*(y*log(ifelse(y == 0, 1, y/mu))-(y - mu)))
poisson.dev(y,hat.mu) * ifelse(y > hat.mu,1,-1)
Esempio
mod.bin<-glm(y~x,family=binomial)
residuals(mod.bin,type="deviance")
hat.mu<-exp(mod.bin$linear)
logistic.dev <- function (y, mu)
sqrt(2*(log(1+mu)-ifelse(y==0,0,log(mu))))
logistic.dev(y,hat.mu)*ifelse(y==1 ,1,-1)
Esempio
file = "http://ww2.coastal.edu/kingw/statistics/
R-tutorials/text/gorilla.csv"
read.csv(file) -> gorilla
glm.out = glm(seen ~ W*C*CW,family=binomial(logit),
data=gorilla)
attach(gorilla)
hat.mu<-exp(glm.out$linear)
logistic.dev(seen,hat.mu) * ifelse(seen==1 ,1,-1)
residuals(glm.out,type="deviance")
residuals(mod,type="working")
(y - fitted(mod))/exp(mod$linear)
mod.bin2<-glm(formula = seen~W+C,
family = binomial(logit),data = gorilla)
head(residuals(mod.bin2,type="partial"))
head(cbind(residuals(mod.bin2,type="response")
+(coef(mod.bin2)[2]*W),
residuals(mod.bin2,type="response")
+(coef(mod.bin2)[3]*C)))
mod.bin2$residuals
residuals(mod.bin2)
• Its value is the log of the odds ratio for the occur-
rence of the event at x + 1 to the occurrence of the
event at x.
• When β̂1 > 0 the curve π(x) will have the shape o
a logistic probability distribution.
60
Parameters interpretation and contrast types
comparison through examples
smoke
Low Birth weight 1 0 Total
1 30 29 59
0 44 86 130
Total 74 115 189
library(MASS)
attach(birthwt)
table.ese1<-table(low, smoke)
61
Let X = 0 or 1 be an independent dichotomous variable
and Y = 0 or 1 a dependent dichotomous variable. The
ratio of the odds for X = 1 to X = 0, called the odds
ratio, is
π(1)/(1 − π(1))
OR =
π(0)/(1 − π(0))
So:
(30/74)/(44/74)
ˆ =
OR = 2.02
(29/115)(86/115)
Therefore,
P (Y = 1|X = 0)
= exp(β0 )
1 − P (Y = 1|X = 0)
eβ0 +β1 X 1
P (Y = 1|X) = =
1 + eβ0 +β1 X 1 + e−(β0 +β1 X)
1
π(1) =
e−β0
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
smoke1 0.7041 0.3196 2.203 0.0276 *
Null deviance: 234.67 on 3 degrees of freedom
Residual deviance: 229.80 on 2 degrees of freedom
AIC: 233.8
Number of Fisher Scoring iterations: 5
62
Exercise: Gastroesophageal reflux disease (GERD)
GERD.data <- matrix(c(251,131,4,33), nrow=2)
colnames(GERD.data)<- c("NO", "YES")
rownames(GERD.data)<-c("stressNO", "stressYES")
table <-as.table(GERD.data)
dft <- as.data.frame(table)
dft
fit.treat<-glm(Var2~Var1,weights=Freq,data=dft,
family=binomial)
summary(fit.treat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1392 0.5040 -8.213 < 2e-16 ***
Var1stressYES 2.7605 0.5403 5.109 3.23e-07 ***
Null deviance: 250.23 on 3 degrees of freedom
Residual deviance: 205.86 on 2 degrees of freedom
AIC: 209.86
The estimated model is
e−4.14+2.76(x)
P̂ (Y = 1|X = x) =
1 + e−4.14+2.76(x)
So subjects without stress will have P(Y=1) equal to
e−4.14
P̂ (Y = 1|X = 0) = = 0.016
1 + e−4.14
while the probability of reflux for stressed subjects is
e−4.14+2.76
P̂ (Y = 1|X = 1) = = 0.20
1 + e−4.14+2.76
63
The estimated odds will be
P̂ (Y = 1|X = 1)
ˆ =
Q1 = e−4.14+2.76 = 0.252
1 − P̂ (Y = 1|X = 1)
for stressed subjects, and
P̂ (Y = 1|X = 0)
ˆ =
Q2 = e−4.14 = 0.0159
1 − P̂ (Y = 1|X = 0)
for not stressed.
fit.sum<-glm(Var2~C(Var1,sum),weights=Freq,data=dft,
family=binomial)
summary(fit.sum)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7589 0.2701 -10.213 < 2e-16 ***
C(Var1, sum)1 -1.3802 0.2701 -5.109 3.23e-07 ***
Null deviance: 250.23 on 3 degrees of freedom
Residual deviance: 205.86 on 2 degrees of freedom
AIC: 209.86
In this case as the contrasts have to sum to zero it fol-
lows:
logit(π0 ) = β0 + β1
logit(π1 ) = β0 − β1
LOR = logit(π1 ) − logit(π0 ) = β0 − β1 − β0 − β1 = −2β1
sum(coef(fit.sum))
coef(fit.treat)[1]
-2*(coef(fit.sum))[2]
coef(fit.treat)[2]
Log-linear models for 2 × 2 tables
afterlife<-matrix(c(435, 147,375,134),nrow=2,byrow=TRUE)
afterlife
dimnames(afterlife)<-list(c("Women","Men"),c("Yes","No"))
afterlife
names(dimnames(afterlife))<-c("Gender","Belief")
afterlife
• H0 : independence, λij = λ ∗ αi ∗ βj
• H1 : λij is arbitrary.
In R:
freq<-c(435,147,375,134)
gender<-gl(2,2)
belief<-gl(2,1,4)
data.afterlife<-data.frame(gender,belief,freq)
fit1.glm<-glm(frequenze~sesso+crede,
family=poisson(link=log), data=dati.afterlife)
summary(fit1.glm)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.99043 0.08248 60.506 <2e-16 ***
beliefYes 1.08491 0.09540 11.372 <2e-16 ***
genderMale -0.09259 0.11944 -0.775 0.438
beliefYes:genderMale -0.05583 0.13868 -0.403 0.687
anova(fit1.glm, test="Chisq")
Parameters interpretation in loglinear models
glm.treat<-glm(freq~(gender*belief), family=poisson,
data=data.afterlife)
glm.treat
logLik(glm.treat)
mu11<-exp(glm.treat$coeff[1])
#mu11<-exp(glm.treat$linear.pred[1])
mu12<-exp(glm.treat$coeff[1]+glm.treat$coeff[3])
#mu12<-exp(glm.treat$linear.pred[2])
mu21<-exp(glm.treat$coeff[1]+glm.treat$coeff[2])
#mu21<-exp(glm.treat$linear.pred[3])
mu22<-exp(glm.treat$coeff[1]+glm.treat$coeff[2]
+glm.treat$coeff[3]+glm.treat$coeff[4])
#mu22<-exp(glm.treat$linear.pred[4])
It will be coincident:
expected.treat<-c(mu11,mu12,mu21,mu22)
glm.treat$fitted
expected.treat
list(data.afterlife,matrix(expected.treat,2,2,byrow=TRUE))
theta1.treat<-log(glm.treat$fitted[1])
theta2.treat<-log(glm.treat$fitted[3]
/glm.treat$fitted[1])
theta3.treat<-log(glm.treat$fitted[2]
65
/glm.treat$fitted[1])
theta4.treat<-log((glm.treat$fitted[1]
*glm.treat$fitted[4])/
(glm.treat$fitted[2]*
glm.treat$fitted[3]))
theta.treat<-c(theta1.treat,theta2.treat,
theta3.treat,theta4.treat)
list(theta.treat,mio.glm.treat$coeff)
c(oddsratio(afterlife,log=TRUE),theta4)
• ANOVA parameterization
We can define the ANOVA contrasts as:
glm.sum<-glm(freq~gender*belief,
family=poisson(link=log),
contrasts=list(gender="contr.sum",
belief="contr.sum"))
#
glm.sum1<-glm(freq~C(gender,sum)*C(belief,sum)
, family=poisson)
glm.sum1
#
GENDER<- C(gender,sum)
BELIEF<- C(belief,sum)
glm.sum2<-glm(freq~GENDER*BELIEF,
family=poisson)
glm.sum2
• Equal likelihoods:
logLik(glm.treat)
logLik(glm.sum)
• Standard errors:
1. from saturated model
X<-model.matrix(~ gender * crede,
family = poisson(link = log),
contrasts = list(gender = "contr.sum",
belief = "contr.sum"))
t(X)%*%diag(freq)%*%X
solve(t(X)%*%diag(freq)%*%X)
sqrt(diag(solve(t(X)%*%diag(freq)%*%X)))
2. from the additive model
glm.sum<-glm(frequenze~sesso+crede,
family=poisson(link=log))
glm.sum
X<-model.matrix(~gender + belief,
family = poisson(link = log),
contrasts = list(gender = "contr.sum",
belief = "contr.sum"))
t(X)%*%diag(fitted(glm.sum))%*%X
solve(t(X)%*%diag(fitted(glm.sum))%*%X)
sqrt(diag(solve(t(X)%*%diag(fitted(glm.sum))%*%X)))
Section 7: 2 × 2 tables simulation
66
Let’s simulate 2 × 2 contingency tables from log-linear
models where coefficients are user defined.
x1<-c(0,1,0,1)
x2<-c(0,0,1,1)
beta0<-1
beta1<-0.5
beta2<--0.2
beta3<-0
mu<-exp(beta0+beta1*x1+beta2*x2)
mu
y<-rpois(4, mu)
y
cbind(x1,x2,muIND=mu, yIND=y)
TAB<-matrix(y,2,2)
n=sum(TAB)
n
67
Looking at the relationships between estimated coef-
ficients and expected frequencies will be obvious that
total size of the table depends by β0 (because it has a
role in generating all the 4 cells on equal terms on main
effects and amount of association).
mu<-exp(3+.5*x1-.2*x2)
y<-rpois(4, mu)
cbind(x1,x2,muIND=mu, yIND=y)
TAB2<-matrix(y,2,2)
n=sum(TAB2)
n
mu<-exp(3+.5*x1-.2*x2+5*x1*x2)
y<-rpois(4, mu)
cbind(x1,x2,muASS=mu, yASS=y)
TAB2<-matrix(y,2,2)
TAB2
set.seed(20)
x1<-c(0,1)
eta<-1+2*x1
pigreco=exp(eta)/(1+exp(eta))
#plogis(eta)
Freq<-rbinom(2,c(100,100),prob=pigreco)
Freq2<-100-Freq
cbind(Freq,Freq2)
set.seed(20)
x1<-c(0,1)
eta<-1+0*x1
pigreco=exp(eta)/(1+exp(eta))
#plogis(eta)
Freq<-rbinom(2,c(100,100),prob=pigreco)
Freq2<-100-Freq
cbind(Freq,Freq2)
fit<-glm(cbind(Freq,Freq2)~x1, family = binomial(logit))
summary(fit)
68
Section 8: Fisher’s exact test
69
A brief theoretical review
70
When has Fisher’s exact test to be used?
71
To understand how Fisher’s exact test works, it is es-
sential to understand what a contingency table is and
how it is used. In the simplest example, there are only
two variables to be compared in a contingency table.
Usually, these are categorical variables.
72
Fisher’s exact test in R: fisher.test()
The fisher.test() performs Fisher’s exact test for te-
sting the null of independence of rows and columns in
a contingency table with fixed marginals.
As first argument you can pass both a matrix or two
vectors having same length.
If x is a matrix, it is taken as a two-dimensional contin-
gency table, and hence its entries should be nonnegative
integers.
The second mandatory argument is the type of alterna-
tive hypothesis. The alternative for a one-sided test is
based on the odds ratio, so alternative = “greater” is a
test of the odds ratio being bigger than or. Two-sided
tests are based on the probabilities of the tables, and
take as “more extreme” all tables with probabilities less
than or equal to that of the observed table, the p-value
being the sum of such probabilities.
Examples:
73
Now, an example of evaluation of αobs by “hand”
tabobs<-matrix(c(3,2,1,9),2,2,byrow=T)
p3<-dhyper(3,5,10,4)
tab1<-matrix(c(4,1,0,10),2,2,byrow=T)
tab2<-matrix(c(2,3,2,8),2,2,byrow=T)
tab3<-matrix(c(1,4,3,7),2,2,byrow=T)
tab4<-matrix(c(0,5,4,6),2,2,byrow=T)
p4<-dhyper(4,5,10,4)
p2<-dhyper(2,5,10,4)
p1<-dhyper(1,5,10,4)
p0<-dhyper(0,5,10,4)
plot(0:4,c(p0,p1,p2,p3,p4),type="h")
points(0:4,c(p0,p1,p2,p3,p4),pch=20)
74
p-value will be equal both in the case of two-sided or
positive association (“grater”) hypothesis:
p3+p4
fisher.test(taboss,alt="greater")$p.value
fisher.test(taboss,alt="two.sided")$p.value
αobs = p3 + p2 + p1 + p0
p3+p2+p1+p0
fisher.test(taboss,alt="less")$p.value
Despite the fact that Fisher’s test gives exact p-values,
some authors have argued that it is conservative, i.e.
that its actual rejection rate is below the nominal signi-
ficance level.
In order to demonstrate it lt us to simulate tables from
an independence model:
x1<-c(0,0,1,1)
x2<-c(0,1,0,1)
mu<-exp(5+.5*x1-.2*x2)
mu
fisher<-rep(NA,1000)
chisq<-rep(NA,1000)
for(i in 1:1000){
y<-rpois(4, mu)
tab<-matrix(y,nrow=2)
fisher[i]<-fisher.test(tab)$p
chisq[i]<-chisq.test(tab)$p.value
}
mean(fisher<=0.05) #ampiezza reale del test
The p-value cumulative function emphasizes this aspect:
plot(sort(fisher),(1:length(fisher))/length(fisher),
bty="l",type="s", xlab="p-valore",ylab="probabilita")
abline(0,1,lty=2) #la f.r. teorica di riferimento
The cumulative function of p-value of the observed is
under the theoretical one (it is known that the under the
null hypothesis the p-value follows an uniform distribu-
tion) this means that the actual rejection rate is below
75
the nominal significance level (0.05), the true probabi-
lity of incorrectly rejecting the null hypothesis is never
greater than the nominal level.
Fisher’s exact test: exercises
• Exercise 1
Data records sex and whether or not the individual
has perfect pitch for 99 conservatory of music stu-
dents. Use Fisher’s exact test of the null hypothesis
that sex is independent of having perfect pitch.
The data can be tabulated as follows.
• Exercise 2
The following table shows the results of a retrospec-
tive study comparing radiation therapy with surgery
in treating cancer of the larynx. The response indi-
cates whether the cancer was controlled for at least
two years following treatment.
76
• Report and interpret the P-value for Fisher’s exact
test with H1 : Ψ > 1 and H1 : Ψ 6= 1. Using the
function fisher.test(), explain how the P-values
were calculated.
77
Polytomous data
78
Nominal response variable: baseline-category logit mo-
dels
Let Y be a nominal response variable with J categories.
Logit models for nominal responses pair each respon-
se category with a baseline category. The choice of
baseline category is arbitrary.
Given a vector x of explanatory variables
J
X
πj (x) = P (Y = j|x) πj (x) = 1
j=1
If we have n independent observations based on these
probabilities, the probability distribution for the number
of outcomes that occur for each J types is a multinomial
with probabilities
(π1 (x), . . . , πJ (x)).
This model is basically just an extension of the binary
logistic regression model. It gives a simultaneous repre-
sentation of the odds of being in one category relative
to being in another category, for all pairs of categories.
Once the model specifies logits for a certain J − 1 pairs
of categories, the rest are redundant.
If the last category (J) is the baseline, the baseline
category logits model
πj (x)
log = αj + βj0 x j = 1, . . . , J − 1
πJ (x)
will describe the effect of x on the J − 1 logits.
79
Notes
= (αa + βa x) − (αb + βb x)
= (αa − αb ) + (βa − βb )x
80
Alligator Food Choice Example
The data is taken from a study by the Florida Game and
Fresh Water Fish Commission of factors influencing the
primary food choice of alligators.
Primary food type has five categories: Fish, Inverte-
brate, Reptile, Birth and Other.
Explanatory variables are the Lake where alligators were
sampled and the Length of alligator.
food<-factor(c("fish","invert","rep","bird","other"),
levels=c("fish","invert","rep", "bird","other"))
size<-factor(c("<2.3",">2.3"),levels=c(">2.3","<2.3"))
gender<-factor(c("m","f"),levels=c("m","f"))
lake<-factor(c("hancock","oklawaha","trafford","george"),
levels=c("george","hancock", "oklawaha","trafford"))
table.7.1<-expand.grid(food=food,size=size,
gender=gender,lake=lake)
temp<-c(7,1,0,0,5,4,0,0,1,2,16,3,2,2,3,3,0,1,2,3,2,2,0,0,1,
13,7,6,0,0,3,9,1,0,2,0,1,0,1,0,3,7,1,0,1,8,6,6,3,5,2,4,1,1,
4,0,1,0,0,0,13,10,0,2,2,9,0,0,1,2,3,9,1,0,1,8,1,0,0,1)
table.7.1<-structure(.Data=table.7.1[rep(1:nrow(table.7.1),
temp),], row.names=1:219)
We fit several models
library(nnet)
81
fitS<-multinom(food~lake*size*gender,data=table.7.1)
fit0<-multinom(food~1,data=table.7.1) # null
fit1<-multinom(food~gender,data=table.7.1) # G
fit2<-multinom(food~size,data=table.7.1) # S
fit3<-multinom(food~lake,data=table.7.1) # L
fit4<-multinom(food~size+lake,data=table.7.1) # L+S
fit5<-multinom(food~size+lake+gender,data=table.7.1) #L+S+G
The likelihood ratio test for each model:
deviance(fit1)-deviance(fitS)
deviance(fit2)-deviance(fitS)
deviance(fit3)-deviance(fitS)
deviance(fit4)-deviance(fitS)
deviance(fit5)-deviance(fitS)
deviance(fit0)-deviance(fitS)
Collapsing over gender:
deviance(fit1)-deviance(fitS)
deviance(fit2)-deviance(fitS)
deviance(fit3)-deviance(fitS)
deviance(fit0)-deviance(fitS)
According to the AIC the best model is fit3:
summary(fit3)
In this example the baseline category is the one tha
crosses “fish”, “ > 2.3” and “george”.
Results:
82
Starting from these results we can evaluate all the re-
dundant odds ratios.
83
Ordinal response variables: Log-Linear Association
models
84
Ordinal response variables: 1. Linear by Linear (Uni-
form) association
• If θ = 0 independence holds.
Models that can fit tables of his type are the row effects
and column effects models.
87
The row effects model R has the form
logµij = λ + λX Y
i + λj + τi vj
For this class of models for any pairs of rows r < s and
columns c < d the log of the odds ratio formed from the
2 × 2 table of those rows and columns is
µrcµsd
log = (τs − τr )(vd − vc )
µrd µsc
88
The column effects model C takes the form
logµij = λ + λX Y
i + λj + ρj ui
89
A generalization of he row and column effects models
that allows for both row and column effects in the local
odds ratio is the row + column effects model (R+C)
logµij = λ + λX Y
i + λj + τi vj + ρj ui
The local log odds ratio for unit-spaced row and column
scores is
(τi+1 − τi ) + (ρj+1 − ρj )
incorporating row effects and column effects.
90
L×L model Example
library(gnm)
library(vcdExtra)
data(Mental) #or in the same way
dati<-expand.grid(mental=c("well","mild",
"moderate","impaired"),ses=1:6)
dati$Freq=c(64,94,58,46,57,94,54,40,57,105,65,60,
72,141,77,94,36,97,54,78,21,71,54,71)
Display the frequency table
Mental.tab <- xtabs(Freq ~ mental+ses, data=Mental)
Fit Independence model
indep <- glm(Freq ~ mental+ses,family = poisson, data = Mental)
deviance(indep) #or
o<-glm(Freq~factor(mental)+factor(ses), family=poisson, data=dati)
deviance(o)
Or
linlin2<-glm(formula = Freq ~ factor(mental) + factor(ses) +
as.numeric(mental):as.numeric(ses),
family = poisson, data = dati)
91
Row effects model Example
roweff <- glm(Freq ~ mental + ses + mental:Cscore,
family = poisson, data = Mental)
92
Column effects model Example
coleff <- glm(Freq ~ mental + ses + Rscore:ses,
family = poisson, data = Mental)
93
Exercise: student perception of statistics class
assessment methods
94
Ordinal response variables: 1. Cumulative Logit Models
The logits of the first J − 1 cumulative probabilities are:
P (Y ≤ j|x)
logit[P (Y ≤ j|x)] = log =
1 − P (Y ≤ j|x)
π1 (x) + π2 (x) + . . . + πj (x)
= log j = 1, . . . , J − 1
πj+1 (x) + . . . + πJ (x)
96
For simplicity, let’s consider only one predictor:
logit[P (Y ≤ j)] = αj + βx
97
Cheese-Tasting Example (McCullagh and Nelder,
1989)
98
• How many logit models?
(J −1)∗(k−1) where Jis the number of the response
categoris and K the number of regressors in the
model;
99
The vglm() function
100
library(VGAM)
cheese <- read.table("cheese.dat.txt",
col.names=c("Cheese", "Response", "N"))
is.factor(cheese$Response)
cheese$Response<-factor(cheese$Response, ordered=T)
mod.sat<-vglm(Response~Cheese,cumulative,
weights=c(N+0.5),data=cheese)
mod.podds<-vglm(Response~Cheese,cumulative(parallel=TRUE),
weights=c(N+0.5),data=cheese)
summary(mod.sat)
summary(mod.podds)
matplot(t(mod.podds@predictors[seq(1,36,by=9),]),type="l",
ylab="Cumulative logits",main="Proportional odds model")
#Add a legend will be surely useful!
matplot(t((exp(mod.podds@predictors)/(1+exp(mod.podds@predictors)))
[seq(1,36,by=9),]),type="l",ylab="Cumulative Probability Curves",
main="Proportional odds model")
101
In this case, a positive coefficient β means that in-
creasing the value of X tends to lower the response
categories (i.e. produce greater dislike).
summary(mod.podds)
Call:
vglm(formula = Response ~ Cheese, family = cumulative(parallel = TRUE),
data = cheese, weights = c(N + 0.5))
Coefficients:
Estimate Std. Error z value
(Intercept):1 -4.84428 0.45697 -10.60089
(Intercept):2 -3.84779 0.37446 -10.27564
(Intercept):3 -2.86231 0.32751 -8.73959
(Intercept):4 -1.91322 0.29232 -6.54497
(Intercept):5 -0.73965 0.25589 -2.89044
(Intercept):6 0.10951 0.24755 0.44237
(Intercept):7 1.44853 0.28180 5.14020
(Intercept):8 2.89229 0.36928 7.83216
CheeseB 2.82260 0.38300 7.36978
CheeseC 1.44005 0.34794 4.13883
CheeseD -1.39122 0.35218 -3.95026
102
CheeseB 2.82260 0.38300 7.36978
CheeseC 1.44005 0.34794 4.13883
CheeseD -1.39122 0.35218 -3.95026
P (Y ≤ 1)
logit[P (Y ≤ 1] = log =
P (Y > 1)
103
104
Ordinal response variables: 2. Adjacent-Category Logi-
ts models
105
Job Satisfaction Example
106
In order to fit an Adjacent-Category Logit model in
R we have to specify the acat family specifying the Link
function applied to the ratios of the adjacent categories
probabilities (loge) and parallel=TRUE A logical if in the
formula some terms are assumed to have equal/unequal
coefficients.
table.7.8<-read.table("jobsat.txt", header=TRUE)
table.7.8$jobsatf<-ordered(table.7.8$jobsat,
labels=c("very diss","little sat","mod sat",
"very sat"))
table.7.8a<- data.frame(expand.grid(income=1:4,
gender=c(1,0)),unstack(table.7.8,freq~jobsatf))
library(VGAM)
fit.vglm<-vglm(cbind(very.diss,little.sat,
mod.sat,very.sat)~gender+income,
family= acat(link="loge",parallel=T,reverse=T),
data=table.7.8a)
summary(fit.vglm)
107
summary(fit.vglm)
Coefficients:
Estimate Std. Error z value
(Intercept):1 -0.550668 0.67945 -0.81046
(Intercept):2 -0.655007 0.52527 -1.24700
(Intercept):3 2.025934 0.57581 3.51842
gender 0.044694 0.31444 0.14214
income -0.388757 0.15465 -2.51372
108
Ordinal response variables: 3. Continuation-Ratio Lo-
gits
109
Esempio: Streptococcus e grandezza delle tonsille
carrier<-c(1,0)
y1<-c(19,497)
y2<-c(29,560)
y3<-c(24,269)
tonsil<-cbind(carrier,y1,y2,y3)
tonsil<-as.data.frame(tonsil)
tonsil$carrier<-as.factor(tonsil$carrier)
library(VGAM)
fit.cratio<-vglm(cbind(y1,y2,y3)~carrier,
family=cratio(reverse=FALSE, parallel=TRUE),
110
data=tonsil)
summary(fit.cratio)
fitted(fit.cratio)
111
Date tre variabili categoriali X, Y e Z, esaminiamo il set
di tutti i possibili modelli log-lineari gerarchici su πijk ,
dove
πijk = P [X = xi , Y = yj , Z = zk ] µijk = nπijk
112
Modello Saturo (XYZ)
log(πijk ) = λ + λX Y Z XY XZ YZ XY Z
i + λj + λk + λij + λik + λjk + λijk
In questo caso tutta l’informazione sulla cella ijk è data
dalla frequenza di cella nijk .
Parametri ] termini
λ 1
λXi I-1
λYj J-1
λZk K-1
λXY
ij (I-1)(J-1)
λXZ
ik (I-1)(K-1)
λYjkZ (J-1)(K-1)
λXY
ijk
Z (I-1)(J-1)(K-1)
Totale IJK
113
Modello di assenza di interazione del secondo ordine o
di associazione omogenea
(XY,XZ,YZ)
log(πijk ) = λ + λX Y Z XY XZ YZ
i + λj + λk + λij + λik + λjk
Per questo modello le equazioni di massima verosimi-
glianza implicano che
µ̂ij+ = nij+ µ̂i+k = ni+k µ̂+jk = n+jk
114
Modello di indipendenza condizionata
(XY,XZ), (XY,YZ) o (XZ,YZ)
E’ possibile definire tre diversi modelli di indipendenza
condizionata, i quali possono essere derivati dal modello
saturo eliminando le interazioni del secondo ordine e una
delle interazioni di ordine 1. Una delle forme possibili
(per il modello (XY,YZ)) è
log(πijk ) = λ + λX Y Z XY YZ
i i + λj + λk + λij + λjk
116
Modello di indipendenza mutua
(X,Y,Z)
117
Il paradosso di Simpson
118
Esempio 1
119
Infatti l’odds stimato per la pena di morte è 1.181 volte
più alto per gli imputati bianchi che per gli imputati ne-
ri. Questo contraddice assolutamente la tesi sostenuta
dell’Organizzazione per la difesa dei neri!
ΦBC = 2.71
AB.C1<-matrix(c(19,132,11,52),2,2,byrow=T)
AB.C1
AB.C1<-as.table(AB.C1)
AB.C1
AB.C2<-matrix(c(0,6,9,97),2,2)
AB.C2
AB.C2<-as.table(AB.C2)
AB.C2
library(vcd)
summary(oddsratio(AB.C1,log=FALSE))
summary(oddsratio(AB.C2,log=FALSE))
summary(oddsratio(AB.C1))
summary(oddsratio(AB.C2))
levels(R.impu)<-c("BI","NE")
levels(R.vitt)<-c("BI","NE")
levels(Pena)<-c("SI","NO")
freq<-c(19,132,0,9,11,52,6,97)
Pena.morte<-data.frame(R.impu,R.vitt,Pena,freq)
xtabs(freq~Pena+R.impu+R.vitt, Pena)
A puro scopo didattico si elenca in seguito la sintassi
dei diversi modelli di associazione per tabelle 2 × 2 × 2,
anche se dalle analisi preliminari fatte in termini di odds
ra tio dovrebbe essere già chiaro qual è la tipologia dei
modelli su cui ci si può orientare.
NOTA: se pensiamo al modello nella sua scrittura per
classi generatrici, sarà particolarmente semplice costrui-
re la giusta sintassi per la formula del glm.
• Modello di indipendenza Mutua [A] [B] [C]
mod1.glm<-glm(freq~R.impu+R.vitt+Pena,
family=poisson)
mod1.glm
summary(mod1.glm)
mod2.glm<-glm(freq~Pena+R.impu*R.vitt,
family=poisson)
summary(mod2.glm)
mod3.glm<-glm(freq~Pena*R.vitt+R.impu*R.vitt,
family=poisson)
summary(mod3.glm)
mod4bis.glm<-glm(freq~Pena*R.vitt+R.impu*R.vitt+
Pena*R.impu,family=poisson)
oppure:
mod4.glm<-glm(freq~(Pena+R.vitt+R.impu)^2 ,
family=poisson)
summary(mod4.glm)
modsaturo.glm<-glm(freq~Pena*R.impu*R.vitt,
family=poisson)
modsaturo.glm<-glm(freq~(Pena+R.impu+R.vitt)^3,
family=poisson)
modsaturo.glm
summary(modsaturo.glm)
Dalla tabella si deduce che i modelli peggiori all’interno
di ogni classe (quelli in grigio) sono quelli che escludono
l’associazione tra razza dell’imputato (A) e razza della
vittima (C). Il modello migliore è [BC][AC] rispetto a G2
e parsimonia dei parametri (AIC), ovvero il modello di
indipendenza condizionata che avevamo già individuato
studiando gli odds ratio condizionati!
120
Esempio 2
The data record details about the Birth to Ten study
(BTT), perfomed in the greater Johannesburg/Soweto
metropolitan area of South Africa during 1990. In the
study, all mothers of singleton births were interviewed
during a seven-week period between April and June to
women with permanent addresses in a defined area (a
total of 4019 births). Five years later, 964 of the-
se mothers were re-interviewed. If the mothers inter-
viewed later and representative of the original popula-
tions, the two groups should show similar characteri-
stics. One of those characteristics is documented here:
the proportion with and without medical aid.
There are eight observations on four variables.
Variables:
AB.C1<-xtabs(Counts~Group+MedicalAid+
Race)[,,1] #Black
AB.C2<-xtabs(Counts~Group+MedicalAid+
Race)[,,2] #White
AC.B1<-xtabs(Counts~Group+MedicalAid+
Race)[,1,] # No Medical aid
AC.B2<-xtabs(Counts~Group+MedicalAid+
Race)[,2,] # Yes Medical Aid
BC.A1<-xtabs(Counts~Group+MedicalAid+
Race)[1,,] #Gruppo1
BC.A2<-xtabs(Counts~Group+MedicalAid+
Race)[2,,] # Gruppo2
library(vcd)
phiAB<-oddsratio(AB) # -0.4713
phiBC<-oddsratio(BC) # 3.903125
phiAC<-oddsratio(AC) # -1.398151
phiAB.C1<-oddsratio(AB.C1) # 0.02837989
phiAB.C2<-oddsratio(AB.C2) # 0.05608947
phiAC.B1<-oddsratio(AC.B1) #-1.442175
phiAC.B2<-oddsratio(AC.B2) #-1.414465
phiBC.A1<-oddsratio(BC.A1) # 3.906292
phiBC.A2<-oddsratio(BC.A2) # 3.934002
c(mod2.glm$deviance,mod3.glm$deviance,
mod4.glm$deviance)
Parte 12: Rappresentazioni grafiche per dati
categoriali
122
Diagramma a barre
E’ la più semplice e immediata rappresentazione per una
variabile categoriale.
I diagrammi a barre sono costituiti da rettangoli (o bar-
re) aventi larghezza arbitraria, ma costante, e altezza
proporzionale alla caratteristica che si vuole rappresen-
tare.
Normalmente un diagramma a barre presenta sull’asse
orizzontale le etichette (le modalità) che identificano le
classi in cui è stata suddivisa la “popolazione” oggetto
di studio e sull’asse verticale viene riportata la frequenza
(assoluta o relativa) osservata per ciascuna classe.
Il diagramma a barre può essere ottenuto in R passando
alla generica funzione plot() una variabile categoriale (di
tipo factor).
SepLeng.categ<-cut(iris$Sepal.Length, breaks=c(4,5,6,7))
plot(SepLeng.categ)
Si osservi che per default le etichette assegnate alle ca-
tegorie sono quelle dichiarate come livelli della variabile.
In questo caso sono di tipo (a, b], (oppure [a, b) se si
sceglie come parametro di cut l’opzione right=FALSE).
Se si vuole assegnare alla categoria un nome per rendere
il grafico più leggibile si può passare alla funzione cut()
il parametro labels:
mie.etichette<-c("corta","media","lunga", "molto lunga")
SepLeng.categ<-cut (iris$Sepal.Length,
breaks=c(4,5,6,7,8), labels=mie.etichette)
plot(SepLeng.categ)
123
Esiste tuttavia una funzione specifica che permette di
ottenere diagrammi a barre più completi e personalizza-
ti: barplot().
par(mfrow=c(1,2))
barplot(VADeaths)
barplot(VADeaths,beside = TRUE)
124
Ovviamente se l’oggetto da rappresentare è un vettore
non ha senso utilizzare l’argomento beside.
d<-c(11,58,65,32,42,55,2,18,70,26)
par(mfrow=c(1,2))
barplot(d,width=3)
barplot(d,width=c(1:length(d)))
par(mfrow=c(2,2))
barplot(VADeaths)
barplot(VADeaths,beside=TRUE)
barplot(VADeaths,beside =TRUE,space=c(1,3))
barplot(d,space=3)
library(vcd)
mosaicplot(TAB.colori,col=TRUE)
125
Poissonness plot (Hoaglin, 1980):distplot()
Sia x0 , x1 , . . . la distribuzione di frequenza (conteggi) os-
servata. Sia N = x0 + x1 + . . .
La funzione di densità di probabilità di Poisson è
Pλ {X = k} = pλ (k) = e−k λk /k! k = 0, 1, 2, . . .
library(vcd)
library(VGAM)
data(ruge)
ruge
distplot(ruge, type = "poisson")
data("Federalist")
sum(Federalist)
distplot(Federalist, type = "poisson")
127
Grafico di associazione di Cohen e Friendly:assocplot()
assocplot(TAB.colori)
128
Parte 13: Esercizi svolti
129
Esercizio 1
Modelliamo la probabilità di sviluppare una malattia car-
diaca (CHD) per ciascuna modalità della variabile espli-
cativa (pressione del sangue BP)
BP<-factor(c("<117","117-126","127-136","137-146",
"147-156","157-166","167-186",">186"))
CHD<-c(3,17,12,16,12,8,16,8)
n<-c(156,252,284,271,139,85,99,43)
structure(cbind(CHD,n),dimnames=list(as.character(BP),
c("Heart Disease","Sample Size")))
logit(pi ) = α + βxi
mod1<-glm(CHD/n ~ scores, family = binomial , weights =n)
summary (mod1)
Le probabilità predette in corrispondenza di ciascun li-
vello di pressione del sangue si possono estrarre dall’og-
getto mod1 usando la funzione predict() e specificando
l’opzione type=‘‘response’’.
predict(mod1,type ="response")
Un diagramma delle proporzioni osservate e di quelle
predette si genera nel modo seguente
p.hat<-predict(mod1, type ="response")
130
plot(scores,CHD/n, xlab ="Livello pressione sanguigna",
ylab="Proporzione")
lines (scores,p.hat)
obs<-cbind(yes=CHD,no=n-CHD)
obs
Le frequenze teoriche
mod1.all<-cbind(yes=n*p.hat,no=n-n*p.hat)
mod1.all
La statistica X 2 è
x.sq<-sum((obs-mod1.all)^2/mod1.all)
x.sq
1-pchisq(x.sq,6)
Le statistiche G2
G2<-mod1$dev
X2<-x.sq
1-pchisq(X2,6)
1-pchisq (G2,6)
LR=mod1$null.deviance-mod1$deviance
1-pchisq(LR,6)
Alcol<- factor(c("0","<1","1-2","3-5",">=6"),
levels=c ("0","<1","1-2","3-5",">=6"))
Alcol
malform<-c(48,38,5,1,1)
n<-c(17066,14464,788,126,37)
option(contrasts("contr.treatment","contr.poly"))
mod.sat<-glm(malform/n~Alcol,family=binomial,weights =n)
mod.sat
revAlcol<- factor(c("0","<1","1-2","3-5",">=6"),
levels=rev(c("0","<1","1-2","3-5" ,">=6")))
mod.sat.rev<-glm(malform/n~revAlcol,
family = binomial,weights =n)
mod.sat.rev
131
Si noti che i valori adattati sono gli stessi per entrambe
le parametrizzazioni e sono uguali alle proporzioni os-
servate perchè entrambi i modelli sono saturi avendo un
numero di osservazioni pari al numero di parametri
cbind(logit=predict(mod.sat),fitted.prop=
predict(mod.sat,type ="response"),malform/n)
cbind(logit=predict(mod.sat.rev),fitted.prop=
predict(mod.sat.rev, type ="response"),malform/n)
plogis(predict(mod.sat.rev))
La statistica X 2
La statistica TRV
mod.ind$deviance
1-pchisq(12.34671,4)
Rigetto l’ipotesi di buon adattamento.
scores<-c(0,0.5,1.5,4,7)
mod.score<-glm(malform/n~scores,family=binomial,weights=n)
summary (mod.score)
La statistica X 2
sum(residuals(mod.score,type="pearson")^2)
LR test
mod.score$null.deviance-mod.score$deviance
cbind(logit=predict(mod.score),fitted.prop=
predict(mod.score, type ="response"))
Esercizio 3
mod<-glm(conteggi~ulcera*stato+aspirina,
family=poisson())
mod2<-glm(conteggi~ulcera*stato+aspirina*stato,
family=poisson())
132
Il modo migliore per stabilire la significatività dell’inte-
razione è paragonare le devianze dei due modelli:
anova(mod, mod2, test="Chisq")
La differenza fra i modelli è altamente significativa. Si
conclude quindi che l’aspirina può essere considerata un
fattore di rischio per l’ulcera. Un modo alternativo per
verificare la significatività del termine di interazione è
quello di analizzare il test di Wald che viene presentato
dalla chiamata:
summary(mod2)
tuttavia, data la sua bassa potenza, per stabilire la signi-
ficatività di un qualunque coefficiente è preferibile ese-
guire il test sulle devianze. Per stabilire se l’aspirina è
associata in modo differente alle localizzazioni di ulcera,
si adatta il modello che comprende anche l’interazione
fra le variabili aspirina e ulcera:
mod3<-glm(conteggi~ulcera*stato*aspirina-
ulcera:stato:aspirina,family=poisson())
res.pear<-residuals(mod3,"pearson")
da cui la statistica X 2 :
X2<-sum(res.pear^2)
1 - pchisq(6.48795, 1)