0% found this document useful (0 votes)
18 views191 pages

Categorical Data Courses

The document provides an introduction to categorical data analysis using R, detailing basic data types, the distinction between character and factor data types, and how to create factors using the factor() function. It also covers ordered factors, the gl() function for generating factors, and the cut() function for converting numeric data into categorical variables. Additionally, it explains how to create frequency tables and contingency tables using the table() function.

Uploaded by

vincenzo.090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views191 pages

Categorical Data Courses

The document provides an introduction to categorical data analysis using R, detailing basic data types, the distinction between character and factor data types, and how to create factors using the factor() function. It also covers ordered factors, the gl() function for generating factors, and the cut() function for converting numeric data into categorical variables. Additionally, it explains how to create frequency tables and contingency tables using the table() function.

Uploaded by

vincenzo.090
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 191

An Introduction to Categorical Data Analysis Using R

Mariangela Sciandra

1
Section 1: Dealing with categorical data

2
Basic R data types

R supports a few basic data types:

• numeric: numbers, either floating point or integer

• integer: integer numbers (with sign)

• complex: a complex value in R is defined via the


pure imaginary value i

• character: each element is a character string

• logical: binary variable with two possible values


represented by TRUE and FALSE.

3
Categorical data: factor vs character data type
Categorical data can be defined in R both as character
or factor.
A character type object is a characters sequence (let-
ters, numbers etc) representing labels;
A factor is the most general data type. Factors are also
called categories or enumerated types.
Think of a factor as a set of category names. Factors
are qualitative classification of objects. Categories do
not imply order. A black snake is different from a brown
snake. It is neither larger nor smaller.
The set of values that the elements of a factor can take
are called its levels.
Examples of categorical data are:

1. a division of a population into males and females

2. the number of dots that appear on the face of a die

3. head or tail in flipping a coin

4. species

5. color of flowers

Categorical data may be presented in graphs. However,


the location of categories along the x or the y axes does
not imply order.
4
In terms of doing statistics, there’s no difference in how
R treats factors and character vectors. In fact, its often
easier to leave factor variables as character vectors.

If you do a regression or ANOVA with lm() with a cha-


racter vector as a categorical variable you’ll get normal
model output but with the message:

warning message:
In model.matrix.default(mt, mf, contrasts) :
variable ’character_x’ converted to a factor

So a factor data type lets the user store categorical data


and treat the values as category labels or levels rather
than characters.

5
How to Create a Factor in R: the factor() function

The factor() function is used to create a factor.

The only required argument is a vector of values which


will be returned as a vector of factor values.

Both numeric and character variables can be made into


factors, but a factor’s levels will always be character
values.

factor(x, levels, labels = levels, exclude = NA,


ordered = is.ordered(x))

6
More arguments:

• levels: determines the categories of the factor va-


riable, and the default is the sorted list of all the
distinct values of the data vector;
It is useful when:
– we want to exclude some categories from the
analysis:
x<-factor(c(1,1,2,3,1),
labels=c("Set1","Set2","Set3"))
factor(x,levels=c("Set1","Set3"))

– when the number of levels is more than those


observed; so in order to specify the factor pro-
perly the levels argument has to be specified;

size<-c("M", "S", "S", "S", "M", "M", "L")


size1<-factor(size,
levels=c("M", "S", "L", "XL"))
size1

• labels:
– it allows to define the labels for the levels

7
Example:

data = c(1,2,2,3,1,2,3,3,1,2,3,3,1)
fdata = factor(data)
fdata
levels(fdata)

From this moment it will be meaningless to apply ma-


thematical operators to the element of the f data vector
because they are labels!

At the same way we can define:

fdata1 = factor(data,labels=c("I","II","III"))
fdata2

or to change levels labels using the levels() function:

levels(fdata) = c(’I’,’II’,’III’)

On the contrary, to remove temporarily the effects of


class, use the function unclass() function:

datanew<-unclass(fdata)
datanew

8
Ordered factors

Sometimes we use the levels of a factor to indicate


order, but not necessarily magnitude.

For example, we can define the label of presidential can-


didates as implying order from the most popular (ha-
ving the most number of votes) to the least popular.
Then in the U.S. elections, we might have the fac-
tor variable named candidate, with 4 levels such that
Gore > Bush > N ader > Buchanan.

One candidate might have gotten 10 million votes and


the other 1 vote. Ordinal data do not reveal this kind
of information!

For example, we generally agree that rabbits are faster


than turtles. We rarely know by how much.

9
In order to create ordered factors in R, the argument
ordered has to be specified.

It is a logical flag to determine if the levels should be


regarded as ordered (in the order given).

size<-c("M", "S", "S", "S", "M", "M", "L")


levels(size)
is.factor(size) #FALSE
is.character(size) #TRUE
size1<-factor(size)
levels(size1) #in alphabetical order
size2<-factor(size,levels=c("S", "M", "L", "XL"))
levels(size2)
size3<-factor(size,levels=c("S", "M", "L", "XL"),
ordered=TRUE)
levels(size3) #but
size2
size3 # now levels have a real order!
is.ordered(size2) #FALSE
is.ordered(size3) #TRUE

10
The gl() function

The gl() function is a different way to create factors in


R. It generates factors by specifying the pattern of their
levels.

gl(n, k, length = n*k, labels = 1:n, ordered = FALSE)

where

• n an integer giving the number of levels;

• k an integer giving the number of replications;

• length an integer giving the length of the resulting


vector;

• labels an optional vector of labels for the resulting


factor levels;

• ordered a logical indicating whether the result should


be ordered or not (the default is not ordered).

11
Examples

degree<-gl(6,2,12)
degree
degree<-gl(6,2,12, labels=c("ness", "elem", "media",
"super","laurea", "speci"))
degree
gender<-gl(n=2,k=2,length=15,labels=c("M","F"))

Now we want to create a factor variable identifying four


groups for 16 observations:

gr<-gl(n=4,k=1,length=16,
labels=c("R.Flood","R.Ctrl","E.Flood","E.Ctrl"))
gr

12
Making numeric data categorical: the cut()
function

Categorical variables can come from numeric variables


by aggregating values.

For example salaries could be placed into broad cate-


gories of 0 − 1 million, 1 − 5 million and over 5 million
or age can be handle as categorical if we split it in age
classes.

In R it is possible to obtain a categorical variable x.f


from its numerical values x by using the cut() function.

The cut() function divides the range of x into intervals


]xi ; xi+1 ] and codes the values in x according to which
interval they fall.

The leftmost interval corresponds to level one, the next


leftmost to level two and so on.

13
The cut() function has two basic arguments: x a nume-
ric vector and breaks a vector of breakpoints;

The breaks argument is used to describe how ranges of


numbers will be converted to factor values:

• if a number is provided, the resulting factor will be


created by dividing the range of the variable into
that number of equal length intervals;

• if a vector of values is provided, the values in the


vector are used to determine the breakpoint.

Note that if a vector of values is provided, the number


of levels of the resultant factor will be one less than the
number of values in the vector.

14
It is a common mistake to believe that the outer break-
points can be omitted but the result for a value outside
all intervals is set to NA; for example:
data(iris)
iris$Sepal.Length
SepLeng.categ<-cut(iris$Sepal.Length, breaks=c(4,5,6,7))
SepLeng.categ
The intervals are left-open, right-closed by default. That
is they include the breakpoint at the right end of each
interval.
data(women)
wfact = cut(women$weight,3)
table(wfact)
If we want to use real number as labels for the new
classes then we can write
wfact = cut(women$weight,3,labels=FALSE)
table(wfact)
The lowest breakpoint is not included unless you set
include.lowest=TRUE, making the first interval closed at
both ends.
wfact = cut(women$weight,3,include.lowest=TRUE)
table(wfact)
Moreover, if you want to set you interval right open you
need to specify right=FALSE.
wfact = cut(women$weight,3,right=FALSE)
table(wfact)
15
Examples

age<-c(2,15,31,26,12,32,34,70,81,5,3)
age.cat<-cut(age, breaks=c(0,18,45,90),
labels=c("young","adult","old"))
eta.cat

library(ISwR)
data(juul)
age<-subset(juul,age>=10 & age<=16)$age
range(age)
agegr<-cut(age,seq(10,16,2),right=FALSE,
include.lowest=TRUE)
length(age)
table(agegr)
agegr2<-cut(age,seq(10,16,2),right=FALSE)
table(agear2)

16
It is sometimes desired to split data into roughly-sized
group.

This can be achieved by using breakpoints computed by


quantile():

q<-quantile(age, probs=c(0,0.25,0.50,0.75,1))
q
ageQ<-cut(age,q,include.lowest=TRUE)
table(ageQ)

The levels names resulting from cut turn out rather ugly
at times. Fortunately they are easily changed.

levels(ageQ)<-c("1st","2nd","3rd","4th")
levels(age)<-c("10-11","12-13","14-15")

17
How to create an array: the array() function

An array can be considered as a multiply subscripted


collection of data entries, for example numeric.

R allows simple facilities for creating and handling ar-


rays, and in particular the special case of matrices.

You can create an array easily with the array() function,


where you give the data as the first argument and a
vector with the sizes of the dimensions as the second
argument.

The number of dimension sizes in that argument gives


you the number of dimensions.

For example, you make an array with three rows, four


columns for two “tables” as:

my.array <- array(1:24, dim=c(3,4,2))


my.array

In a k-dimensional array each element is characterized


by a vector of indices with length equal to k.

The number of observations in the array is given by the


product of the dimensions; so for example the array just
created will consist of 24 elements.

18
In order to extract an element we use the same syntax
used for matrices and vectors, with the only difference
that will now have to be specified 3 indices:

my.array[3,1,2] extract a single element


my.array[,1,2] extract the first column of the second matrix
my.array[3,,1] extract the third row of the first matrix

An example of four dimensional array (4 × 2 × 2 × 2)

Titanic
dim(Titanic)

19
Scalar product in R: the crossprod() function

Matrices and vectors cross products, as


t(x)% ∗ %y
can be obtained in R as crossprod(x,y);

when only the first argument is specified, then crossprod(x)


corresponds to X T X.

Note that even if the output will be a single value its


class is matrix, so we could use the drop() function in
order to transform it to a scalar.

M<-matrix(,nrow=3,ncol=4)
M[]<-1:12
G<-matrix(,nrow=3,ncol=4)
G[]<-25:36
crossprod(M,G)

The dimension of the resulting matrix is 4 × 4

t(M)%*%G

while

crossprod(M)

will return a matrix corresponding to

t(M)%*%M

20
The Hadamard product of matrices

Let A = [aij ] and B = [bij ] be two matrices with equal


dimension (m × n).

The Hadamard product is a new (m × n) matrix defined


as C = A B where the generic element (cij ) is obtained
as
cij = aij bij i = 1, . . . , m j = 1, . . . , n

In other words, the Hadamard product between two ma-


trices of the same size is calculated simply multiplying
each element of the matrix to that of the corresponding
cell of the second matrix.

In R the Hadamard product is the result of

M*G

21
The Kronecker product: kron(), kronecker()
Let A = [aij ] and B = [bij ] be two matrices of arbitrary
sizes (m × n) and (p × q), rispettivamente.
Then the Kronecker product of these two matrices is a
(mp × nq) block matrix defined as:
 
a11 B . . . a1n B
 a21B . . . a2nB 
A⊗B=  ... ... ... 

am1 B . . . amn B

In R we the Kronecker product between two matrices


can be obtained by using the kronecker() function or
the kron() function of the fBasics library.
M<-matrix(,nrow=3,ncol=4)
M[]<-1:12
H<-matrix(,nrow=2,ncol=4)
H[]<-1:8
dim(M)
dim(H)

library(fBasics)
K<-kron(M,H)
dim(K)

M<-matrix(c(1,2,3,1),2,2,byrow=TRUE)
G<-matrix(c(0,3,2,1),2,2,byrow=TRUE)
kron(M,G)
kronecker(M,G)
22
Section 2: Creating and manipulating two ways
frequency tables

23
I × J contingency tables: joint distribution

R provides many methods for creating frequency and


contingency tables. Several are described following.

Frequency tables can be obtaining using the table()


function.

color.eyes<-c("B","D","B","D","D","D","B","B","D","D","D")
color.hair<- c("Dark","Dark","Dark","Blond","Blond",
"Blond","Blond","Dark","Dark","Dark","Dark")
table(color.eyes,color.hair)

Bivariate or more generally multivariate distributions can


be obtained augmenting the numbers of arguments spe-
cified in table();

the resulting contingency tables will be arrays with di-


mension depending on the number of the factor argu-
ments and the number of their levels.

24
Some examples

color.eyes<-c("B","D","B","D","D","D","B","B","D","D","D")
color.hair<- c("Dark","Dark","Dark","Blond","Blond",
"Blond","Blond","Dark","Dark","Dark","Dark")
gender<-gl(n=2,k=1,length=11,labels=c("M","F"))
city<-gl(n=3,k=1,length=11,labels=c("PA","CT","AG"))

Bivariate distributions

table(color.eyes,city)#2x3
table(color.eyes,color.hair) #2x2
table(color.eyes,gender) #2x2
table(color.hair,gender) #2x2
table(color.hair,city) #2x3
table(gender,city) #2x3

Trivariate distributions

table(color.eyes,color.hair,city) #array 2x2x3


table(color.eyes,color.hair,gender) #array 2x2x2
table(color.eyes,city,gender) #array 2x3x2
table(color.hair,city,gender) #array 2x3x2

Multivariate distributions (array 2x2x3x2)

table(color.eyes,color.hair,city,gender)

25
The way to create a contingency table changes if we
are not dealing with individual observations but we only
know the joint distribution; in this case a contingency
table is obtained properly defining a matrix:
Pauling,1971 example:
Table.P<-matrix(c(31,17,109,122),2,2)
dimnames(Table.P)<-list(c("Placebo","Ascorbic Acid"),
c("Yes","No"))
Table.P
Tonsil example
Tons<-matrix(c(53,19,829,497),2,2)
dimnames(Tons)<- list(c("Enlarged", "Not enlarged"),
c("Carrier","Not carrier"))
Tons
Suicidal tendencies example
Suicide<-matrix(c(26,39,39,20,27,27,195,93,34)
,3,3,byrow=TRUE)
dimnames(Suicide)<- list(c("Attempted","Contemplated",
"Nothing"),c("Healthy","moderately depressed",
"severely depressed"))
Suicidio
Death penalty and race characteristics example
Agresti<-array(c(19,0,132,9,11,6,52,97),c(2,2,2))

dimnames(Agresti)<- list(Victimrace=c("White","Black"),
DeathPenalty=c("Yes","NO"),
Defendantrace=c("White", "Black"))
Agresti
26
Marginal and conditional distributions

A conditional distribution is the distribution of one va-


riable for just those cases that satisfy a condition on
another: so conditional distributions are represented by
particular rows or column of the table.

In order to obtain the marginal distributions the margin.table()


is very useful.

For a contingency table in array form, compute the sum


of table entries for a given index.

If the second argument is (1) then it returns row mar-


ginals by summing values in the table along rows; if the
second argument is (2) then the column marginals are
returned by summing values in the table along column.

If the second argument is not specified then all cell fre-


quencies will be summed up and the output will be the
total number of observations N.

27
Examples:

• Conditional distribution of the depression status for


subjects that tempted suicide:

Suicidio[1,]

• Conditional distribution of tonsil status for subjects


with streptococcus ifections:

Tons[,1]

• Conditional distributions of patients status for indi-


viduals under placebo:

Table.P[1,]

• marginal distribution of suicidal tendency:

margin.table(Suicidio,1)

• marginal distribution of health status:

margin.table(Table.P,2)

28
The margin.table() function works also with arrays:

• marginal distribution of death penalty :

margin.table(Agresti,2)

• the total number of observations is

margin.table(Table.P)
margin.table(Tons)
margin.table(Suicidio)
margin.table(Agresti)

29
Relative frequency tables: the prop.table()
function
When using prop.table() on a multidimensional table,
it’s necessary to specify which marginal sums you want
to use to calculate the proportions.
To use the row sums, specify 1; to use the column sums,
specify 2 and so on.
If the second argument is not specified then the re-
lative frequencies are the result of the ratio between
the frequencies in each cell and the total number of
observations N!

• Joint relative frequencies

prop.table(Suicidio)

• Row percentages

prop.table(Suicidio,1)

• Column percentages

prop.table(Suicidio,2)

When the interest is in working with conditional rela-


tive distributions the second argument must be pas-
sed because it represent the margin to which we are
conditioning on!
30
Section 3: Coding categorical variables

31
Let Xn,k be the data matrix of k variables observed on n
statistical units, when one or more of the k variable are
categorical, these can be represented using a different
coding.

• Full Disjunctive Coding


A disjunctive table is a drill-down of a table defi-
ned by n observations and q qualitative variables
V (1), V (2), . . . , V (q) into a table defined by n ob-
servations and p indicators (or dummy variables)
where p is the sum of the numbers of categories of
the q variables: each variable V (j) is broken down
into a sub-table with q(j) columns where column k
contains 1’s for observations corresponding to the
k-th category and 0 for the other observations.

For example the Full Disjunctive Coding of the vector


 
N
 C 
R= C 
 
 S 
N
will be

N orth Center South


1 0 0
0 1 0
R=
0 1 0
0 0 1
1 0 0
32
The use of the FDC can be convenient even if the va-
riable is dichotomic in order to deal with all the factors
coded in the same way.

It can be shown that using this type of codification to


create a contingency table is very easy.

For example, suppose to observe two factors on n units,


where the first factor has two levels and the second
three;

The FDC of these two factor will return two matrices


Xn,2 and Zn,3 with columns containing only 0 and 1.

It can be shown that the contingency table is the result


of the Hadamard product of the these two matrices.
X T Z = C2,3
Example

X<-matrix(c(1,1,0,1,0,0,0,1,0,1),5,2)
Z<-matrix(c(1,0,0,0,0,0,0,1,0,1,0,1,0,1,0),5,3)
X
Z
crossprod(X,Z)

The numbers in this matrix are the joint frequencies.

33
Comparison between FDC and cross coding
Using the same example compare the two coding me-
thods:

1. Full Disjunctive Coding: we match the two FDC


matrices. The dimension of the resulting matrix will
be n and a number of columns given by the sum of
the levels of each factor. In the example above a
n × 5.
 
1 0 1 0 0
 1 0 0 0 1 
R= 0 1 0 1 0 
 
 1 0 0 0 1 
0 1 0 1 0

2. Cross code: the resulting matrix has as columns all


the possible combinations of the levels of the two
factors. So the number of columns will be equal to
the product of the numbers of levels of each factor.
In the example the cross codification returns a n × 6
matrix.
 
1 0 0 0 0 0
 0 0 1 0 0 0 
R= 0 0 0 0 1 0 
 
 0 0 1 0 0 0 
0 0 0 0 1 0
Now it is not allowed to have more than one 1 on
the same row!
34
Full Disjunctive Coding
Try to write an R function that returns the FDC of a
vector:
cdc <- function(x, names=NULL) {
if (!is.vector(x))
stop("x deve essere un vettore")
if (!is.factor(x))
x <- factor(x)
n <- length(levels(x))
m <- NULL for (i in 1:n) {
yes <- levels(x)[i]
v <-as.numeric(x == yes)
if (is.null(m)) {
m <- v }
else {
m <- cbind(m,v)
}
}
if (is.null(names)) {
colnames(m) <- levels(x)
}
else {
colnames(m) <- names
}
m
}

regioni<-c("N","C","C","S","N")
cdc(regioni)
35
Example and exercises

The ISTAT statement on Labour Force in the 2nd quar-


ter of 2009 includes, among other things, the seasonal-
ly adjusted data on the distribution of employed people
seeking employment in the north, center and south Italy.
The data can be placed in R as follows:

employed <- factor(c(rep(1,11996),rep(0,612),rep(1,4853),


rep(0,353), rep(1,6318),rep(0,874)))

regions <-factor(c(rep("N",12608),rep("C",5206),
rep("S",7192)))

table(employed, regions)
X1 <- cdc(employed, names=c("D","O"))
X2 <- cdc(regions)

The contingency table can be obtained as:

t(X1) %*% X2 #or


crossprod(X1,X2)

Exercises:

1. Building the complete disjunctive matrix for exam-


ple ascorbic acid-common cold

2. Try to write a function that returns the cross coding


of a 2 × 2 table

36
Cross coding function by Marco Ventimiglia

cross<-function(a,b){
if(length(a)!=length(b))
stop(’the two vectors MUST
have the same length!’)
a<-factor(a)
b<-factor(b)
d<-paste(a,b)
lev.a<-levels(a)
lev.b<-levels(b)
nla<-length(lev.a)
nlb<-length(lev.b)
cat.n<-matrix(,nrow=nla,ncol=nlb)
for(i in 1:nla)
for(j in 1:nlb)
cat.n[i,j]<-paste(lev.a[i],lev.b[j])
names<-c(cat.n)
n.obs<-length(a)
n.col=nla*nlb
out<-matrix(nrow=n.obs,ncol=nla*nlb)
for(k in 1:n.obs)
for(l in 1:n.col)
ifelse(d[k]==names[l],out[k,l]<-1,out[k,l]<-0)
colnames(out)<-names
return(out)
}

37
Visualizing data with many dimensions
When the number of variables to analyse increases the
use of the array() function creates some problems (ove-
rall in visualizing them).
An alternative method consists in representing data in
the lexicographical order; it consists in reorganizing
the data in a matrix where the rows are all the possible
combinations of the categories of the variables.
The as.data.frame() function applied to an array objects
allows to do that.
For example look at the array HairEyeColor available
in R:
is.array(HairEyeColor)
To print the results more attractively we could use the
ftable() function, that allows to flat the table table:
ftable(HairEyeColor)
A different output is obtained if we use the as.data.frame()
function because now a number of column will be create
given by the total combination of levels of the factors.
HEC<-as.data.frame(HairEyeColor)
The last column will contain the frequencies and will be
automatically created while the other columns will take
the names and the features from the original array.
38
The opposite operation, that is to create an array from
a table organized in lexicographical order, can be imple-
mented using the xtabs() function with arguments the
original dataframe and the “formula” where the variable
of interest are specified.

HEC<-as.data.frame(HairEyeColor)
xtabs(Freq~Hair+Eye+Sex,data=HEC)

library(vcd)
data(Arthritis)
art <- xtabs(~ Treatment + Improved, data = Arthritis)
art
The expand.grid() function

It creates a data frame from all combinations of the


supplied vectors or factors.

schizo.counts<-c(90,12,78,13,1,6,19,13,50)
schizo.table<-cbind(expand.grid(list(
Origin=c("Biogenic","Environmental","Combination"),
School=c("Eclectic","Medical","Psychoanalytic"))),
count=schizo.counts)

Note that unlike as.data.frame() which automatically


calculates the frequency column, expand.grid() only re-
turns combinations of levels; frequencies can be intro-
duced by binding the vector argumentcount by column.

religion.counts<-c(178,570,138,138,648,252,108,442,252)
Tab<-cbind(expand.grid(list(Highest.Degree=c("<HS",
"HS or JH","Bachelor or Grad"),Religious.Beliefs=
c("Fund", "Mod", "Lib"))),count=religion.counts)
Tab

39
Section 4: Simulating sampling schemes for 2 × 2
tables

40
Two-ways contingency tables can be the result of dif-
ferent generator processes and then the result of a dif-
ferent sampling scheme according to the constrained
specified. So the interest will be in identify the table
generating process where the cells will become random
variables and the probability distribution of the table will
depend on the sampling scheme.

Generating two-way tables of counts is similar to gene-


rating one-way tables of counts, but with higher degree
of complexity. Main generating probability mechanisms
are Poission, Binomial and Multinomial models, but for
two-way tables the marginals play a big role.

We will discuss the following sampling schemes:

• Unrestricted sampling (Poisson)

• Sampling with fixed total sample size (Multinomial)

• Sampling with fixed certain marginal totals (Product-


Multinomial, Hypergeometric)

41
1. Poisson Sampling:
• in a Poisson sampling scheme counts are obser-
ved over fixed time interval or spatial domain;
• the probability of the event is approximately pro-
portional to length of time for small intervals of
time;
• for small intervals of time probability of more
than one occurrence is neglibile compared to
probability of one event;
• the numbers of events in non-overlapping time
intervals are independent.
In this scheme none of the marginal totals (or
grand total) are known!
Then, the four cells as well as the gran total N are
random variables.
How to simulate a Poisson sampling scheme
in R

In order to simulate a two ways contingency table


from a poisson sampling scheme we only need to
specify the Poisson parameter λ and then put our
observations in a matrix form:

data1<-rpois(4,lambda=5)
table.pois1<-matrix(data1,2,2)

data2<-rpois(4,lambda=1)
table.pois2<-matrix(data2,2,2)

table.pois1
table.pois2

The first argument of the rpois() function specifies


the number of observations we want to generate (4
in a 2 × 2 table).
Of course, the higher the value of the λ parameter
the greater will be the table total size N.

42
2. Multinomial Sampling:
Let consider now to collect data on a predetermi-
ned number of individuals and classify them accor-
ding to two binary variables (e.g. treatment and
response).
This scheme is related the Poisson scheme, except
the grand total N is a fixed quantity.
As in the Poisson scheme, each subject sampled
falls into one of the four cells of the table.
Numbers in the cells are from a multinomial distri-
bution.
How to simulate a Multinomial sampling
scheme in R

In order to simulate a table from a multinomial sam-


pling scheme we can use the rmultinom() function:

N=30
data<-rmultinom(1,size=N,prob=c(.25,.25,.25,.25))
table.mult<-matrix(data,2,2)

The rmultinom() function asks for three arguments


to be necessarily specified:
• n: the number of random vectors to generate
(in our example 1);
• size: the fixed grand total N (for example N=30);
• prob: numeric non-negative vector of length K
(K=4 for 2 tables), specifying the probability for
the K classes; is internally normalized to sum 1.

x<-rmultinom(n=1,size=30,prob=c(.125,.125,.375,.375))
tabella.multinom<-matrix(x,2,2,byrow=TRUE)
tabella.multinom
margin.table(tabella.multinom)

The total N will be equal to the fixed size.


3. Product multinomial sampling:
Here the marginal frequencies ni. (or n.j ) are fi-
xed in advance: we pick fixed numbers for each
level of the explanatory factor and have an indepen-
dent multinomial (binomial for two response levels)
distribution in each row.
Hence one margin is fixed by design while the
other is free to vary. So the row totals are
fixed.
The resulting table can be thought as the product
of the two column (o row) distributions, under the
assumption of independence of such distributions.
This type of sampling is called Independent Mul-
tinomial Sampling. If the “response” variable has
only two levels, it is also called Independent Bino-
mial Sampling, which is a special case of indepen-
dent multinomial sampling.
Viewing the data as product-multinomial is appro-
priate when the row totals truly are fixed by design,
as in:
(a) stratified random sampling (strata defined by Y)
(b) an experiment where Y = treatment group
It’s also appropriate when the row totals are not fi-
xed, but we are interested in P (Z|Y ) and not P (Y ).
That is, when Z is the outcome of interest, and Y
is an explanatory variable that we do not wish to
model.
How to simulate a Product multinomial
sampling scheme in R

Try to simulate data from a product binomial sam-


pling scheme where the column totals are fixed by
design:

n.1<-20
n.2<-10
col1<-rmultinom(n=1,size=n.1,prob=c(.5,.5))
col2<-rmultinom(n=1,size=n.2,prob=c(.5,.5))
tabella.prodmultinom<-cbind(col1,col2)
tabella.prodmultinom
margin.table(tabella.prodmultinom,2)
4. Hypergeometric sampling:
all the row and column totals are fixed!
The best-known example of this type of process
is the Fisher’s example of the “Lady TastingTea”,
which we will discuss when we will talk about Exact
Tests.
In a 2 × 2 table, the resulting sampling distribution
is hypergeometric.
Note that even when both the row and column mar-
gins are fixed, it does not imply that all cell frequen-
cies are fixed! (You may create a small 2 × 2 with
row total (5, 3) and column totals (4, 4) and see
for yourself what different tables are possible).
If all totals are fixed just one cell will be random
because once you know one of the joint frequen-
cies remaining values in the table can be uniquely
defined.
This cell will be a realization of a hypergeometric
distribution: it will be a random variable with boun-
ded support as the cell frequency cannot be greater
than frequencies of the respective row and column
margins.
How to simulate a hypergeometric sampling
scheme in R

In order to simulate a 2 × 2 table in R from a hyper-


geometric sampling scheme we can use the rhyper()
function:

rhyper(nn,m,n,k)

where
• nn = is the number of random numbers
• m = max(n1. ,n.1 )
• n = n.. - max(n1. ,n.1 )
• k = min(n1. ,n.1 )
Example

rhyper(1,50,20,15)
12

the output will be


12 38 50
3 17 20
15 55 70
Generate a table with n = 30 and row and column
margins respectively equal to 10 and 15

n11<-rhyper(1,15,15,10)
n12<-10-n11
n21<-15-n11
n22<-15-n12
X<-matrix(c(n11,n12,n21,n22),2,2,byrow=TRUE)

The probability of observing this table is

dhyper(n11,15,15,10)

or

dhyper(n11,max((n11+n12),(n11+n21)),
(n11+n12+n21+n22)-max((n11+n12),
(n11+n21)),min((n11+n12),(n11+n21)))
Types of study designs: real examples

1. Cross-section study: suppose that a sample of


1166 individuals has been extracted and once in
the sample was asked them if they were in favour
or against the legal prohibition of intermarriage and
their religious affiliation. Note in this example the
total sample size is fixed! Therefore, a cross-section
study may be thought as the result of a multino-
mial sampling scheme. It is emphasized in order do
not think that the sample is divided into at the be-
ginning of the study in Catholics and Protestants;
only after subjects have become part of the sam-
ple their religion affiliation is identified. Thus in a
cross-section study individuals are sampled and only
after classified simultaneously on both variables.
Joint probability distribution
Liberale Conservatore Cattolico Ateo TOT
Favorevole 103 182 80 16 381
Contrario 187 238 286 74 785
TOT 290 420 366 90 1166

Liberale Conservatore Cattolico Ateo TOT


Favorevole 0.088 0.156 0.069 0.014 0.327
Contrario 0.160 0.204 0.245 0.063 0.673
TOT 0.248 0.360 0.314 0.078 1

The only statement we can do looking at the joint


distribution is: the 16% of the sample is liberal
and is against the legal prohibition of intermarria-
ge. However, if we ask ourselves: between Liberal
43
and Protestants how many people are in favor and
and how many against intermarriage? To answer
this question is necessary to calculate the condi-
tional distribution of racial prejudice on religious
affiliation.
Conditional distribution B|A
Liberale Conservatore Cattolico Ateo
Favorevole 0.355 0.433 0.219 0.178
Contrario 0.645 0.567 0.781 0.822
TOT 1.00 1.00 1.00 1.00
The conditional distribution B|A allows to make
comparisons between different religious affiliations.
subjects not following any type of religion and Ca-
tholics seem to be less conservative than Protestan-
ts (liberal and conservative) because they have the
lowest percentages of people that are in favour of
the law. Among liberals there is a more open at-
titude than conservatives because there are more
people who are against the law.
On the contrary, if the question was: between those
who are in favor of (or against) what is the percen-
tage of Catholics Protestants and etc. ..In this case
we should calculate the conditional distribution of
religious affiliation given the racial prejudice.
Conditional distribution A|B
Liberale Conservatore Cattolico Ateo TOT
Favorevole 0.270 0.478 0.210 0.042 1.00
Contrario 0.238 0.304 0.344 0.094 1.00
These percentages have a different interpretation:
il 48% of people that are in favour of the law of the
legal prohibition of intermarriage is conservative.
Moreover, looking at the two conditional distribu-
tion tables we can see that the variables are not
independent because conditional distributions are
different from the marginal once.

2. Case-control study: A case-control study is a type


of study design used widely, often in epidemiology.
It is a type of observational study in which two
existing groups differing in outcome are identified
and compared on the basis of some supposed causal
attribute.
Case-control studies are often used to identify fac-
tors that may contribute to a medical condition by
comparing subjects who have that condition/disease
(the “cases”) with patients who do not have the
condition/disease but are otherwise similar (the “con-
trols”).
An advantage of case-contro studies is that you
acquire sufficient sample sizes for all levels of Y ,
which is useful for rare conditions.
You must be careful to note that the sampling rates
for the levels of Y are fixed by the investigator.
Thus, you cannot estimate conditional probabilities
across the rows directly. For example, just because
you picked 200 cases and 200 controls does NOT
mean the population must have 50% cases.
Example:

Contraceptive use vs Heart attack YES NO TOT


Used 23 34 57
Not Used 35 132 167
TOT 58 166 224

In a case control study, we might expect product multi-


nomial sampling along the columns under the indepen-
dence hypothesis.

We can’t refer to Heark attack as an outcome variable


and the contraceptive use as an explanatory variable; we
can only extract results from the conditional distribution
of contraceptive use given the heart attack event.
Section 5: Association measures in contingency
tables

44
Inference for a single proportion
The function prop.test() will carry out test of hypo-
theses and produce confidence intervals in problems in-
volving one or several proportions. In the example con-
cerning opinion on abortion, there were 424 “yes” re-
sponses out of 950 subjects. Here is one way to use
prop.test() to analyze these data:
prop.test(424,950)
Note that by default:

• the null hypothesis π = 0.5 is tested against the


two-sided alternative π 6= 0.5;

• a 95% confidence interval for π is calculated;

• both the test and the CI incorporate a continuity


correction.

Any of these defaults can be changed.


The call above is equivalent to
prop.test(424,950,p=.5,alternative="two.sided",
conf.level=0.95,correct=TRUE)
Thus, for example, to test the null hypothesis that π =
0.4 versus the one-sided alternative π > 0.4 and a 99%
(one-sided) CI for π, all without continuity correction,
just type
prop.test(424,950,p=.4,alternative="greater",
conf.level=0.99,correct=FALSE)
45
Comparing Proportions in Two-by-Two Tables

As explained by the documentation for prop.test(), the


data may be represented in several different ways for use
in prop.test(). We will use the matrix representation in
examining the Physician’s Health Study example.

phs <- matrix(c(189,10845,104,10933),byrow=TRUE,ncol=2)


phs
dimnames(phs)<-list(Group=c("Placebo","Aspirin"),
MI=c("Yes","No"))
phs
prop.test(phs)

A continuity correction is used by default, but it makes


very little difference in this example:

prop.test(phs,correct=F)

You can also save the output of the test and manipulate
it in various ways:

phs.test <- prop.test(phs)


names(phs.test)
phs.test$estimate
phs.test$conf.int

I can obtain the relative risk as:

phs.test$estimate[1]/phs.test$estimate[2]

46
Relative risk and Odds ratio

Relative risk and the odds ratio are easy to calculate


(you can do it in lots of ways of course):

phs.test$estimate
odds <- phs.test$estimate/(1-phs.test$estimate)
odds[1]/odds[2]
# as cross-prod ratio
(phs[1,1]*phs[2,2])/(phs[2,1]*phs[1,2])

Here’s one way to calculate the CI for the odds ratio:

theta <- odds[1]/odds[2]


ASE <- sqrt(sum(1/phs))
ASE
logtheta.CI <- log(theta) + c(-1,1)*1.96*ASE
logtheta.CI
exp(logtheta.CI)

47
It is easy to write a quick and dirty function to do these
calculations for a 2 × 2 table.

odds.ratio <-
function(x, pad.zeros=FALSE, conf.level=0.95) {
if (pad.zeros) {
if (any(x==0)) x <- x + 0.5
}
theta <- x[1,1] * x[2,2] / ( x[2,1] * x[1,2] )
ASE <- sqrt(sum(1/x))
CI <- exp(log(theta)
+ c(-1,1) * qnorm(0.5*(1+conf.level)) *ASE )
list(estimator=theta,
ASE=ASE,
conf.interval=CI,
conf.level=conf.level)
}

For the example above:

odds.ratio(phs)
Relationship between RR and OR
Let RE be the risk (incidence) of exposed and let RN E
be the incidence of no-exposed; then OR is
RE
1−RE RE 1 − RN E RE 1 − RN E 1 − RN E
OR = RN E
= ∗ = ∗ = RR ∗
1−RN E
1 − RE RN E RN E 1 − RE 1 − RE

That is the OR is equal to OR multiplied by a factor


such that:

• if RN E < RE then this factor increases the risk and


the OR is greater than the RR;

• if RN E > RE then the coefficient will be < 1 and


the OR will be smaller than the RR.

OR amplifies the measure of association with respect


to the relative risk;

48
The OR well approximates the RR when the disease
incidence is low, because in this case, RN and RE are
very small and the coefficient is very close to 1.

Examples:

• Disease with a high incidence

A<-matrix(c(10,50,2,50),2,2,byrow=TRUE,
dimnames = list(c("Exposed", "No-exposed"),
c("Sick", "Healthy")))
RE<-10/60
RNE<-2/52
RRA<-RE/RNE
ORA<-oddsratio(A,log=FALSE)
c(RRA,ORA)

• Disease with a low incidence

B<-matrix(c(10,10000,2,10000),2,2,byrow=TRUE,
dimnames = list(c("Exposed", "No-exposed"),
c("Sick", "Healthy")))
RE<-10/10010
RNE<-2/10002
RRB<-RE/RNE
ORB<-oddsratio(B,log=FALSE)
c(RRB,ORB)

49
Contingencies and Chi-Squared Tests of Indepen-
dence
The chisq.test() function will compute Pearson’s chi-
squared test statistic (X 2 ) and the corresponding P-
value.
For the Job-satisfaction example:
jobsatis <- c(2,4,13,3, 2,6,22,4, 0,1,15,8, 0,3,13,8)
jobsatis <- matrix(jobsatis,byrow=TRUE,nrow=4)
dimnames(jobsatis) <- list(
Income=c("<5","5-15","15-25",">25"),
Satisfac=c("VD","LS","MS","VS"))
jobsatis
chisq.test(jobsatis)
In case you are worried about the chi-squared appro-
ximation to the sampling distribution of the statistic,
you can use simulation to compute an approximate P-
value (or use an exact test). The argument B (default
is 2000) controls how many simulated tables are used
to compute this value. It is interesting to do it a few
times though to see how stable the simulated P-value
is (does it change much from run to run). In this ca-
se the simulated P-values agree closely with the chi-
squared approximation, suggesting that the chi-squared
approximation is good in this example
chisq.test(jobsatis,simulate.p.value=TRUE,B=10000)
In this case the P-value agrees closely with the simulated
P-values.
50
Example:

A group of 219 students, 119 males (M) and 100 fema-


les (F) is subjected to an aptitude test. The disciplines
examined are: A (arts), B (humanities), C (science).

Study the association between sex and attitude by ma-


king use of contingencies (differences between the ab-
solute frequencies effectively observed and the absolute
frequencies under the independence hypothesis.

gender<-c("M","F")
aptitude<-c("A","B","C")
data<-matrix(c(35,22,40,27,44,51),2,3,
dimnames=list(gender,aptitude))
tab<-as.table(data)
chi<-summary(tab)$statistic
chi
Exercise 1
In a retrospective study aimed to determine whether
coffee consumption affects the risk of myocardial infarc-
tion, data were collected on patients living in a small to-
wn. Coffee consumption has been classified as low (<=
3 cups a day) or high (> 3 cups a day). The results are
shown in the table.
Coffee consumption Infarction YES Infarction NO TOT
High 78 1322 1400
Low 48 1352 1400
TOT 126 2674 2800

Evaluate the relative risk and the odds ratio (to assess
whether the high consumption of coffee increases the
risk of myocardial compared to low consumption).
We calculate separately the risk for those that make
little (LW) and high (HG) consumption of coffee:
RLW=48/1400
RHG=78/1400
RR=RHG/RLW
RR [1] 1.625

Odds ratio:
coffe<-matrix(c(78,48,1322,1352),2,2)
oddsratio(coffe,log=FALSE)
[1] 1.661876
51
The odds ratio amplifies the measure of association with
respect to the relative risk; it is further away from 1 with
respect to the relative risk.

The risk of stroke in people who do a HG coffee con-


sumption is 1.66 times higher than of those who make
LW consumption.
Exercise 2

In two hypothetical cohort studies, study A and stu-


dy B, we evaluated the role of exposure to a chemical
pollutant in relation to the cumulative incidence of two
diseases. In both studies 1000 people were observed.
In the A study, 500 people have been exposed to the
chemical pollutant while 500 were not exposed; people
affected by the disease in the two groups respectively
were 5 and 1. In the B study , 400 people were exposed
to the chemical pollutant and 600 have not been expo-
sed; people affected by the diseases in the two groups
were respectively 100 and 30.

1. Build two tables 2 × 2 (one for each study);

2. calculate the relative risk in each study;

3. calcolate the odds ratio for each study.

StudioA<-matrix(c(5,1,495,499),byrow=FALSE,2,2)
StudioB<-matrix(c(100,30,300,570),byrow=FALSE,2,2)
RRA<-(100/400)/(30/600)
RRB<-(5/500)/(1/500)
library(vcd)
OA<-oddsratio(StudioA,log=FALSE)
OB<-oddsratio(StudioB,log=FALSE)

Despite having the same relative risk, the odds ratios


are different!
52
Exercises in R

1. In a study of human blood types in nonhuman pri-


mates, a sample of 71 orangutans were tested and
14 were found to be blood type B. Use R to con-
struct a 95% confidence interval for the propor-
tion (probability) of blood type B in the orangu-
tan population and interpret the interval you just
computed.

2. Experimental studies of cancer often use strains of


animals that have a naturally high incidence of tu-
mors. In one such experiment, tumor prone mice
were kept in a sterile environment with one group
of mice maintained entirely germ free and the other
group of mice exposed to the intestinal bacterium
Eschericbia coli. The accompanying table shows
the incidence of liver tumors. Let pS and pE.coli
represent the probabilities of liver tumors under the
sterile and the E. coli conditions, respectively.

Sterile E. coli
Liver Tumor 19 8
No Tumor 30 5

Use R to construct a 95% confidence interval for


difference in proportions pS − pE.coli.

53
3. Each person in a sample of 276 healthy adult volun-
teers was asked about the variety of social networks
that they were in (e.g., relationships with parents,
close neighbors, workmates, etc.). They were then
given nasal drops containing a rhinovirus and were
quarantined for 5 days. Numbers of subjects who
developed colds were recorded in the table below.

Five or Fewer Six or more


Cold 57 52
No Cold 66 101

Test for an association between the number of ty-


pes of social relationships and developing a cold at
a = 0.05.

4. Herpes simplex virus type 2 (HSV2) is a sexually


transmitted disease. As part of the third National
Health and Nutrition Examination Survey (NHA-
NES III), prevalence of HSV2 was determined in
four regions of the United States. The data are
given in the following table

Northeast Midwest South West


HSV2 323 381 1320 712
No HSV2 1165 1689 4003 1986

Use a chi-square test to compare the prevalence


rates at a = 0.01. (This is a question about the
distribution of HSV-2 being different across the four
regions, but worded in terms of prevalence rates).
Section 6: Models for 2×2 tables

54
Generalized Linear Models

The Generalized Linear Model (GLM) is a flexible ge-


neralization of ordinary linear regression that allows for
response variables that have other than a normal distri-
bution.

The GLM generalizes linear regression by allowing the


linear predictor to be related to the response variable
via a link function and by allowing the magnitude of the
variance of each measurement to be a function of its
predicted value.

Generalized linear models are just as easy to fit in R


as ordinary linear model. In fact, they require only an
additional parameter to specify the variance and link
functions.

The basic tool for fitting generalized linear models is the


glm() function, which has the following general structu-
re:

glm(formula, family, data, weights, subset, ...)

where . . . stands for more esoteric options. The com-


pulsory argument family is a simple way of specifying a
choice of variance and link functions.

55
Six possible families can be chosen:

As can be seen, each of the first five choices has an


associated variance function (for binomial the binomial
variance µ(1 − µ)), and one or more choices of link func-
tions (for binomial the logit, probit or complementary
log-log).

As long as you want the default link, all you have to


specify is the family name. If you want an alternative
link, you must add a link argument. For example to do
probits you use

glm( formula, family=binomial(link=probit))

The last family on the list, quasi, is there to allow fitting


user-defined models by maximum quasi-likelihood.
The first argument of the function is a model formula,
which defines the response and linear predictor.

With binomial data the response can be either a vector


or a matrix with two columns:

• If the response is a vector, it is treated as a bina-


ry factor with the first level representing “success”
and all others representing “failure”. In this case R
generates a vector of ones to represent the binomial
denominators.

• Alternatively, the response can be a matrix where


the first column shows the number of “successes”
and the second column shows the number of “failu-
res”. In this case R adds the two columns together
to produce the correct binomial denominator. You
can use the function cbind() to create a matrix by
binding the column vectors of frequencies.

Following the special symbol ∼ that separates the re-


sponse from the predictors, we have a standard Wilkinson-
Rogers model formula.
How to set contrasts in R

By default, R uses treatment contrasts, as you can see


when you check the relevant option like this:

options(’contrasts’)

$contrasts
unordered ordered
"contr.treatment" "contr.poly"

Here you see that R uses different contrasts for unor-


dered and ordered factors. These contrasts are actually
contrast functions.

They return a matrix with the contrast values for each


level of the factor. The default contrasts for a factor
with three levels look like this:

X <- factor(c(’A’,’B’,’C’)) contr.treatment(X)


B C
A 0 0 B 1 0 C 0 1

The two variables B and C are called that way because


the variable B has a value of 1 if the factor level is
B; otherwise, it has a value of 0. The same goes for
C. Level A is represented by two zeros and called the
reference level.

In a one-factor model, the intercept is the mean of


A.

56
You can change these contrasts using the same options()
function, like this:

options(contrasts=c(’contr.sum’,’contr.poly’))

The contrast function, contr.sum(), gives orthogonal


contrasts where you compare every level to the ove-
rall mean. You can get more information about these
contrasts on the Help page ?contr.sum.
The plot() function applied to the model object will re-
produce the same plots as in a linear model, but adapted
to a generalized linear model; for example the residuals
plotted are deviance residuals (the square root of the
contribution of an observation to the deviance, with the
same sign as the raw residual).

The functions that can be used to extract results from


the fit include

• residuals or resid, for the deviance residuals

• fitted or fitted.values, for the fitted values (esti-


mated probabilities)

• predict, for the linear predictor (estimated logits)

• coef or coefficients, for the coefficients, and

• deviance, for the deviance.

57
Glm residuals in R
There are many different kinds of residuals for GLMs.
If you were coming to this from an OLS perspective, you
would expect that the residual would be yi − ŷi . In the
GLM framework, you are tempted to replace ŷi with µ̂i .
However, that’s not correct because µ̂i is a prediction
about the mean, while yi is an observed value.
Think of a logit model, where the “predicted probabi-
lity” is the µ̂i . Is the residual equal to the difference
between the observed 1 (or 0) and µ̂i ?
Several alternative residuals have been recommended.
Notice that if you fit a model in R, and then want the
residuals, you can specify 5 kinds of residuals.
help(residuals.glm)

1. “response”: response residuals are simply the dif-


ferences between the observed response and its esti-
mated expected value:
yi − µ̂i
These differences correspond to the ordinary resi-
duals in the linear model. Apart from the Gaussian
or normal case, the response residuals are not used
in diagnostics, however, because they ignore the
nonconstant variance that is part of a GLM.

mod<-glm(y~x,family=poisson)
58
residuals(mod,"response")
y-fitted(mod)

2. “pearson” : Pearson residuals are casewise com-


ponents of the Pearson goodness of-fit statistic for
the model:
yi − µ̂i
epea = q
d|x)/φ̂
V ar(y i

These are a basic set of residuals for use with a


GLM because of their direct analogy to linear mo-
dels.

residual(mod,type="pearson")
(y-fitted(mod))/sqrt(fitted(mod))

3. “deviance”: deviance residuals are defined as the


square roots of the casewise components of the re-
sidual deviance, attaching the sign of yi − µ̂i . In the
linear model, the deviance residuals reduce to the
Pearson residuals. The deviance residuals are of-
ten the preferred form of residual for GLMs, and
are returned by default applying the residuals()
function.
p
D
ri = sgn(yi − µ̂i ) (di )

The total deviance then will be D = ni=1 di , where


P

di = 2φ log{f (yi ; yi ; φ) − f (yi ; µ̂i ; φ)}


and
f (yi ; µi ; φ)
is the probability density function of Yi .

residuals(mod,type="deviance")

hat.mu<- exp(mod$linear)
poisson.dev <- function (y, mu)
sqrt(2*(y*log(ifelse(y == 0, 1, y/mu))-(y - mu)))
poisson.dev(y,hat.mu) * ifelse(y > hat.mu,1,-1)

Esempio

counts <- c(18,17,15,20,10,20,25,13,12)


outcome <- gl(3,1,9)
treatment <- gl(3,3)
glm.D93 <- glm(counts ~ outcome + treatment,
family=poisson())
resid(glm.D93,type="deviance")
fit <- exp(glm.D93$linear)
poisson.dev(counts,fit) * ifelse(counts > fit,1,-1)

In binary logistic regression it will be:

mod.bin<-glm(y~x,family=binomial)
residuals(mod.bin,type="deviance")
hat.mu<-exp(mod.bin$linear)
logistic.dev <- function (y, mu)
sqrt(2*(log(1+mu)-ifelse(y==0,0,log(mu))))
logistic.dev(y,hat.mu)*ifelse(y==1 ,1,-1)
Esempio

file = "http://ww2.coastal.edu/kingw/statistics/
R-tutorials/text/gorilla.csv"
read.csv(file) -> gorilla
glm.out = glm(seen ~ W*C*CW,family=binomial(logit),
data=gorilla)
attach(gorilla)
hat.mu<-exp(glm.out$linear)
logistic.dev(seen,hat.mu) * ifelse(seen==1 ,1,-1)
residuals(glm.out,type="deviance")

4. “working”: working residuals are “the residuals in


the final iteration of the IWLS fit” (the working
dependent variable in the IWLS algorithm minus
the linear predictor)

residuals(mod,type="working")
(y - fitted(mod))/exp(mod$linear)

In R the function to apply in order to obtain the li-


near predictor will depend on the link function used.
In this example I used the exponential because I
supposed o use the canonical link of the poisson
distribution (log).

5. and “partial”: a matrix of working residuals formed


by omitting each term in the model.

mod.bin2<-glm(formula = seen~W+C,
family = binomial(logit),data = gorilla)
head(residuals(mod.bin2,type="partial"))
head(cbind(residuals(mod.bin2,type="response")
+(coef(mod.bin2)[2]*W),
residuals(mod.bin2,type="response")
+(coef(mod.bin2)[3]*C)))

OSS: The output of

mod.bin2$residuals
residuals(mod.bin2)

is different because in the first case R returns the


working residuals, in the second case the default is
the deviance type residuals.
Logistic regression model

Given a set x of explanatory variables, in a logistic re-


gression model, as the outcome variable Y is dicho-
tomous, the stochastic component Y |x is assumed
to follow a bernoulli distribution with E(Y ) = P (Y =
1) = π(x), where π(x) is a function of regressors x =
(x1 , . . . , xp ).

The systematic component η, is assumed to be linear,


that is
η = Xβ .

Finally, the link function, the function linking the sto-


chastic component to the systematic part is assumed
to be the logit function, that is the logarithm of the
ratio between success and failure probabilities give the
x vector.

So, the logistic regression function expression is:


  p
π(x) X
logit(π(x)) = log = β0 + βj xj = Xβ .
1 − π(x) j=1

In other words, the logistic regression model is a linear


regression model on the logit reparameterization of the
natural binomial parameter π(x); indeed:
exp Xβ
π(x) =
1 + exp Xβ
59
From this expression it follows
π(x)
= exp(Xβ )
1 − π(x)
and so the relationship with the systematic component
is linear in terms of the logarithm of odds ratio:
 
π(x)
log = Xβ
1 − π(x)

One of the main reasons to use liner logistic regression


model is that model coefficients can be interpreted
as log of odds ratios. This implies that the estimates
of relative risk and their standard errors can be easily
derived from the fitted model.
Parameters interpretation in logistic regression

Let’s suppose to consider only a single explanatory va-


riable; the logistic regression model will be
logit(π(x)) = β0 + β1 x
What is the interpretation of β1 ?

• Its signs indicates the direction of the effect of the


explanatory variable x on the probability of obser-
ving a success π(x);

• Its value is the log of the odds ratio for the occur-
rence of the event at x + 1 to the occurrence of the
event at x.

• When β̂1 = 0, the response variable will be inde-


pendent on x.

• When β̂1 > 0 the curve π(x) will have the shape o
a logistic probability distribution.

60
Parameters interpretation and contrast types
comparison through examples

Suppose we are interested in the cross classification bet-


ween low birth weight (Y=0 or 1) and the mother’s
smoking status (X=0 or 1):

smoke
Low Birth weight 1 0 Total
1 30 29 59
0 44 86 130
Total 74 115 189

Smoke represents the number of women who smoke (1


yes, 0 no) and low birth weight represents the number
of low birth weight babies.

library(MASS)
attach(birthwt)
table.ese1<-table(low, smoke)

61
Let X = 0 or 1 be an independent dichotomous variable
and Y = 0 or 1 a dependent dichotomous variable. The
ratio of the odds for X = 1 to X = 0, called the odds
ratio, is

π(1)/(1 − π(1))
OR =
π(0)/(1 − π(0))

So:
(30/74)/(44/74)
ˆ =
OR = 2.02
(29/115)(86/115)

In this case, the odds ratio represents the risk of low


birth weight for a baby born to a smoking mother com-
pared to a mother who does not. In other words, smo-
king mothers are over twice as likely to give birth to low
weight babies compared to non smoking mothers.

Next, we wish to cast the odds ratio in a logistic regres-


sion framework. For X = 1 and Y = 1, we have
   
P (Y = 1|X) P (Y = 1|X)
log = log
P (Y = 0|X) 1 − P (Y = 1|X)
The logistic regression assumes that X and Y are related
via the linear model so
 
P (Y = 1|X)
log = β0 + β1 X
1 − P (Y = 1|X)
where β0 is the intercept and β1 is the slope of the
regression.

When X=0 we obtain


 
P (Y = 1|X = 0)
log = β0
1 − P (Y = 1|X = 0)

Therefore,

P (Y = 1|X = 0)
= exp(β0 )
1 − P (Y = 1|X = 0)

In other words, eβ0 is the odds ratio of obtaining Y=1


for X = 0.

When X =1, we have


   
P (Y = 1|X = 1) P (Y = 1|X = 0)
log − log =
P (Y = 0|X = 1) P (Y = 0|X = 0)
β0 + β1 − β0 = β1

Therefore we interpret β1 as the log odds ratio.


In terms of probability of Y = 1 it will be

eβ0 +β1 X 1
P (Y = 1|X) = =
1 + eβ0 +β1 X 1 + e−(β0 +β1 X)

For X = 0 and Y = 1 we have

1
π(1) =
e−β0

So the following table can be derived:

dati.ese1 <- as.data.frame(table.ese1)


mod<-glm(low~smoke,weights=Freq,family=binomial,
data=dati.ese1)
summary(mod)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
smoke1 0.7041 0.3196 2.203 0.0276 *
Null deviance: 234.67 on 3 degrees of freedom
Residual deviance: 229.80 on 2 degrees of freedom
AIC: 233.8
Number of Fisher Scoring iterations: 5

The value of β̂1 = 0.704 is significant (p = 0.03). The-


refore, the odds ratio is
ˆ = e0.704 = 2.02
OR
as expected.

62
Exercise: Gastroesophageal reflux disease (GERD)
GERD.data <- matrix(c(251,131,4,33), nrow=2)
colnames(GERD.data)<- c("NO", "YES")
rownames(GERD.data)<-c("stressNO", "stressYES")
table <-as.table(GERD.data)
dft <- as.data.frame(table)
dft
fit.treat<-glm(Var2~Var1,weights=Freq,data=dft,
family=binomial)
summary(fit.treat)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.1392 0.5040 -8.213 < 2e-16 ***
Var1stressYES 2.7605 0.5403 5.109 3.23e-07 ***
Null deviance: 250.23 on 3 degrees of freedom
Residual deviance: 205.86 on 2 degrees of freedom
AIC: 209.86
The estimated model is
e−4.14+2.76(x)
P̂ (Y = 1|X = x) =
1 + e−4.14+2.76(x)
So subjects without stress will have P(Y=1) equal to
e−4.14
P̂ (Y = 1|X = 0) = = 0.016
1 + e−4.14
while the probability of reflux for stressed subjects is
e−4.14+2.76
P̂ (Y = 1|X = 1) = = 0.20
1 + e−4.14+2.76
63
The estimated odds will be
P̂ (Y = 1|X = 1)
ˆ =
Q1 = e−4.14+2.76 = 0.252
1 − P̂ (Y = 1|X = 1)
for stressed subjects, and
P̂ (Y = 1|X = 0)
ˆ =
Q2 = e−4.14 = 0.0159
1 − P̂ (Y = 1|X = 0)
for not stressed.

This means that the odds for stresses individuals is 15.80


times greater than not stressed subjects.
In other words e2.7605 = 15.80 is the estimated odds
ratio.

If we change the contrasts type

fit.sum<-glm(Var2~C(Var1,sum),weights=Freq,data=dft,
family=binomial)
summary(fit.sum)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7589 0.2701 -10.213 < 2e-16 ***
C(Var1, sum)1 -1.3802 0.2701 -5.109 3.23e-07 ***
Null deviance: 250.23 on 3 degrees of freedom
Residual deviance: 205.86 on 2 degrees of freedom
AIC: 209.86
In this case as the contrasts have to sum to zero it fol-
lows:

logit(π0 ) = β0 + β1
logit(π1 ) = β0 − β1
LOR = logit(π1 ) − logit(π0 ) = β0 − β1 − β0 − β1 = −2β1

sum(coef(fit.sum))
coef(fit.treat)[1]
-2*(coef(fit.sum))[2]
coef(fit.treat)[2]
Log-linear models for 2 × 2 tables

We start by considering the simplest possible contingen-


cy table: a two-by- two table. However, the concepts to
be introduced apply equally well to more general two-
way tables where we study the joint distribution of two
categorical variables.

Belief in the afterlife

As part of the 1991 General Social Survey, conducted by


the National Opinion Research Center asked participants
about whether they believed in an afterlife. The data,
broken down by gender are

Lets examine whether belief and gender are associated.

afterlife<-matrix(c(435, 147,375,134),nrow=2,byrow=TRUE)
afterlife
dimnames(afterlife)<-list(c("Women","Men"),c("Yes","No"))
afterlife
names(dimnames(afterlife))<-c("Gender","Belief")
afterlife

Question: is belief in afterlife independent on gender?


64
• Model: Yij ∼ Poisson(λij )

• Mean function: logλij = log(λ) + log(αi ) + log(βj )

• H0 : independence, λij = λ ∗ αi ∗ βj

• H1 : λij is arbitrary.
In R:

freq<-c(435,147,375,134)
gender<-gl(2,2)
belief<-gl(2,1,4)
data.afterlife<-data.frame(gender,belief,freq)
fit1.glm<-glm(frequenze~sesso+crede,
family=poisson(link=log), data=dati.afterlife)
summary(fit1.glm)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.99043 0.08248 60.506 <2e-16 ***
beliefYes 1.08491 0.09540 11.372 <2e-16 ***
genderMale -0.09259 0.11944 -0.775 0.438
beliefYes:genderMale -0.05583 0.13868 -0.403 0.687

anova(fit1.glm, test="Chisq")
Parameters interpretation in loglinear models

• Corner point parameterization

glm.treat<-glm(freq~(gender*belief), family=poisson,
data=data.afterlife)
glm.treat
logLik(glm.treat)
mu11<-exp(glm.treat$coeff[1])
#mu11<-exp(glm.treat$linear.pred[1])
mu12<-exp(glm.treat$coeff[1]+glm.treat$coeff[3])
#mu12<-exp(glm.treat$linear.pred[2])
mu21<-exp(glm.treat$coeff[1]+glm.treat$coeff[2])
#mu21<-exp(glm.treat$linear.pred[3])
mu22<-exp(glm.treat$coeff[1]+glm.treat$coeff[2]
+glm.treat$coeff[3]+glm.treat$coeff[4])
#mu22<-exp(glm.treat$linear.pred[4])

It will be coincident:

expected.treat<-c(mu11,mu12,mu21,mu22)
glm.treat$fitted
expected.treat

list(data.afterlife,matrix(expected.treat,2,2,byrow=TRUE))

theta1.treat<-log(glm.treat$fitted[1])
theta2.treat<-log(glm.treat$fitted[3]
/glm.treat$fitted[1])
theta3.treat<-log(glm.treat$fitted[2]
65
/glm.treat$fitted[1])
theta4.treat<-log((glm.treat$fitted[1]
*glm.treat$fitted[4])/
(glm.treat$fitted[2]*
glm.treat$fitted[3]))

theta.treat<-c(theta1.treat,theta2.treat,
theta3.treat,theta4.treat)

list(theta.treat,mio.glm.treat$coeff)

The interaction parameter is equal to the lo-


garithm of the odds ratio.
So, in a log-linear model the only interest is in the
interaction parameter, the main effects are meanin-
gless.

c(oddsratio(afterlife,log=TRUE),theta4)

In our example, there is little evidence for including


the interaction and thus belief in an afterlife and
gender appear to be independent.

• ANOVA parameterization
We can define the ANOVA contrasts as:

glm.sum<-glm(freq~gender*belief,
family=poisson(link=log),
contrasts=list(gender="contr.sum",
belief="contr.sum"))
#
glm.sum1<-glm(freq~C(gender,sum)*C(belief,sum)
, family=poisson)
glm.sum1
#
GENDER<- C(gender,sum)
BELIEF<- C(belief,sum)
glm.sum2<-glm(freq~GENDER*BELIEF,
family=poisson)
glm.sum2

Exercise: verify the relationship between ANOVA esti-


mates and cell frequencies obtained using theoretical
results;
In terms of inferential results the two contrasts types
give the same results.

• Equal likelihoods:

logLik(glm.treat)
logLik(glm.sum)

• Standard errors:
1. from saturated model
X<-model.matrix(~ gender * crede,
family = poisson(link = log),
contrasts = list(gender = "contr.sum",
belief = "contr.sum"))
t(X)%*%diag(freq)%*%X
solve(t(X)%*%diag(freq)%*%X)
sqrt(diag(solve(t(X)%*%diag(freq)%*%X)))
2. from the additive model

glm.sum<-glm(frequenze~sesso+crede,
family=poisson(link=log))
glm.sum
X<-model.matrix(~gender + belief,
family = poisson(link = log),
contrasts = list(gender = "contr.sum",
belief = "contr.sum"))
t(X)%*%diag(fitted(glm.sum))%*%X
solve(t(X)%*%diag(fitted(glm.sum))%*%X)
sqrt(diag(solve(t(X)%*%diag(fitted(glm.sum))%*%X)))
Section 7: 2 × 2 tables simulation

66
Let’s simulate 2 × 2 contingency tables from log-linear
models where coefficients are user defined.

First, it is necessary to define two dichotomous varia-


bles:

x1<-c(0,1,0,1)
x2<-c(0,0,1,1)

Once coefficient values β0 , β1 , β2 e β3 are fixed, we can


simulate mean values as:

beta0<-1
beta1<-0.5
beta2<--0.2
beta3<-0
mu<-exp(beta0+beta1*x1+beta2*x2)
mu

obviously in this example we are simulating from an in-


dependence model (β3 = 0). Once mean value have
been evaluated the table can be simulated by genera-
ting 4 observations from a Poisson distribution where
the λ parameter is the vector µ coming from the fitted
log-linear model.

y<-rpois(4, mu)
y
cbind(x1,x2,muIND=mu, yIND=y)
TAB<-matrix(y,2,2)
n=sum(TAB)
n
67
Looking at the relationships between estimated coef-
ficients and expected frequencies will be obvious that
total size of the table depends by β0 (because it has a
role in generating all the 4 cells on equal terms on main
effects and amount of association).

mu<-exp(3+.5*x1-.2*x2)
y<-rpois(4, mu)
cbind(x1,x2,muIND=mu, yIND=y)
TAB2<-matrix(y,2,2)
n=sum(TAB2)
n

Let’s try to simulate from a general model:

mu<-exp(3+.5*x1-.2*x2+5*x1*x2)
y<-rpois(4, mu)
cbind(x1,x2,muASS=mu, yASS=y)
TAB2<-matrix(y,2,2)
TAB2

Exercise: Repeat some simulation varying coefficients


values and study as the cell frequencies will change!
In the same way let’s to simulate 2 × 2 contingency ta-
bles from a logistic regression model with user defined
coefficients.

set.seed(20)
x1<-c(0,1)
eta<-1+2*x1
pigreco=exp(eta)/(1+exp(eta))
#plogis(eta)

Freq<-rbinom(2,c(100,100),prob=pigreco)
Freq2<-100-Freq
cbind(Freq,Freq2)

fit<-glm(cbind(Freq,Freq2)~x1, family = binomial(logit))


summary(fit)

From an independence model:

set.seed(20)
x1<-c(0,1)
eta<-1+0*x1
pigreco=exp(eta)/(1+exp(eta))
#plogis(eta)
Freq<-rbinom(2,c(100,100),prob=pigreco)
Freq2<-100-Freq
cbind(Freq,Freq2)
fit<-glm(cbind(Freq,Freq2)~x1, family = binomial(logit))
summary(fit)

68
Section 8: Fisher’s exact test

69
A brief theoretical review

Fisher’s exact test is a statistical significance test used


in the analysis of contingency in order to verify if data
in a 2 × 2 contingency table confirm the hypothesis (H0 )
which holds that the two categorical variables have no
association with each other.

Although in practice it is employed when sample sizes


are small, it is valid for all sample sizes. It is named
after its inventor, R. A. Fisher, and is one of a class
of “exact” tests, so called because the significance of
the deviation from a null hypothesis can be calculated
exactly, rather than relying on an approximation that
becomes exact in the limit as the sample size grows to
infinity, as with many statistical tests.

With large samples, a chi-squared test can be used. Ho-


wever, the significance value it provides is only an appro-
ximation, because the sampling distribution of the test
statistic that is calculated is only approximately equal
to the theoretical chi-squared distribution. The appro-
ximation is inadequate when sample sizes are small, or
the data are very unequally distributed among the cells
of the table, resulting in the cell counts predicted on the
null hypothesis (the “expected values”) being low.

70
When has Fisher’s exact test to be used?

The usual rule of thumb for deciding whether the chi-


squared approximation is good enough is that the chi-
squared test is not suitable when the expected values
in any of the cells of a contingency table are below 5,
or below 10 when there is only one degree of freedom
(this rule is now known to be overly conservative). In
fact, for small, sparse, or unbalanced data, the exact
and asymptotic p-values can be quite different and may
lead to opposite conclusions concerning the hypothesis
of interest. For contingency tables with a large sample
size and well-balanced numbers in each cell of the table,
Fisher’s exact test is not accurate, and the chi-square
test is preferred.

Exact computations are based on the statistical theory


of exact conditional inference for contingency tables

Fisher’s exact test is definitely appropriate when the row


totals and column totals are both fixed by design. once
margins have been fixed the joint frequencies distribu-
tion will not depend on nuisance parameters (as total
sample size,row and column margins) but only on the
strength of association between A and B through the
cross product ratio Ψ.

71
To understand how Fisher’s exact test works, it is es-
sential to understand what a contingency table is and
how it is used. In the simplest example, there are only
two variables to be compared in a contingency table.
Usually, these are categorical variables.

To verify independence is equivalent to verify:


  
H0 : Ψ = 1 H0 : Ψ = 1 H0 : Ψ = 1
or or
H1 : Ψ 6= 1 H1 : Ψ > 1 H1 : Ψ < 1

Under the independence assumption, that is if the null


hypothesis H0 : Ψ = 1 holds, with margin fixed by de-
sign, the table will follow an hypergeometric standard
distribution; so, in order to conduct the conditioned test
with a significance level α it will be necessary:

1. to order all possible tables that have the same mar-


gins of the observed ones on the basis of their
probability under the null hypothesis;

2. to evaluate αobs as the sum of evidence provided by


the observed data (or any more extreme table) for
the null hypothesis.

3. the smaller the value of αobs , the greater the evi-


dence for rejecting the null hypothesis.

72
Fisher’s exact test in R: fisher.test()
The fisher.test() performs Fisher’s exact test for te-
sting the null of independence of rows and columns in
a contingency table with fixed marginals.
As first argument you can pass both a matrix or two
vectors having same length.
If x is a matrix, it is taken as a two-dimensional contin-
gency table, and hence its entries should be nonnegative
integers.
The second mandatory argument is the type of alterna-
tive hypothesis. The alternative for a one-sided test is
based on the odds ratio, so alternative = “greater” is a
test of the odds ratio being bigger than or. Two-sided
tests are based on the probabilities of the tables, and
take as “more extreme” all tables with probabilities less
than or equal to that of the observed table, the p-value
being the sum of such probabilities.
Examples:

1. x<-matrix(c(1, 9,11, 3),nrow=2,byrow=T,


dimnames=list(Diet=c("Yes","No"),
Gender=c("M","F")))
fisher.test(x)
fisher.test(x,alternative="less")

2. TeaTasting <-matrix(c(3, 1, 1, 3),nrow=2,


dimnames=list(Guess=c("Milk","Tea"),
Truth =c("Milk","Tea")))
fisher.test(TeaTasting, alternative = "greater")

73
Now, an example of evaluation of αobs by “hand”

Let’s to observe the table

tabobs<-matrix(c(3,2,1,9),2,2,byrow=T)

p3<-dhyper(3,5,10,4)

tables with the same marginals will be

tab1<-matrix(c(4,1,0,10),2,2,byrow=T)
tab2<-matrix(c(2,3,2,8),2,2,byrow=T)
tab3<-matrix(c(1,4,3,7),2,2,byrow=T)
tab4<-matrix(c(0,5,4,6),2,2,byrow=T)

The probability of observing each of the previous tables


under the independence hypothesis will be:

p4<-dhyper(4,5,10,4)
p2<-dhyper(2,5,10,4)
p1<-dhyper(1,5,10,4)
p0<-dhyper(0,5,10,4)

plot(0:4,c(p0,p1,p2,p3,p4),type="h")
points(0:4,c(p0,p1,p2,p3,p4),pch=20)

If the alternative hypothesis is two-sided, then:


αobs = p3 + p4
as the table with n11 = 4 has observed probability lower
than p4. So, in this particular example the observed

74
p-value will be equal both in the case of two-sided or
positive association (“grater”) hypothesis:

p3+p4
fisher.test(taboss,alt="greater")$p.value
fisher.test(taboss,alt="two.sided")$p.value

On the contrary, if the alternative hypothesis holds a


negative association then

αobs = p3 + p2 + p1 + p0

p3+p2+p1+p0
fisher.test(taboss,alt="less")$p.value
Despite the fact that Fisher’s test gives exact p-values,
some authors have argued that it is conservative, i.e.
that its actual rejection rate is below the nominal signi-
ficance level.
In order to demonstrate it lt us to simulate tables from
an independence model:
x1<-c(0,0,1,1)
x2<-c(0,1,0,1)
mu<-exp(5+.5*x1-.2*x2)
mu
fisher<-rep(NA,1000)
chisq<-rep(NA,1000)
for(i in 1:1000){
y<-rpois(4, mu)
tab<-matrix(y,nrow=2)
fisher[i]<-fisher.test(tab)$p
chisq[i]<-chisq.test(tab)$p.value
}
mean(fisher<=0.05) #ampiezza reale del test
The p-value cumulative function emphasizes this aspect:

plot(sort(fisher),(1:length(fisher))/length(fisher),
bty="l",type="s", xlab="p-valore",ylab="probabilita")
abline(0,1,lty=2) #la f.r. teorica di riferimento
The cumulative function of p-value of the observed is
under the theoretical one (it is known that the under the
null hypothesis the p-value follows an uniform distribu-
tion) this means that the actual rejection rate is below
75
the nominal significance level (0.05), the true probabi-
lity of incorrectly rejecting the null hypothesis is never
greater than the nominal level.
Fisher’s exact test: exercises

• Exercise 1
Data records sex and whether or not the individual
has perfect pitch for 99 conservatory of music stu-
dents. Use Fisher’s exact test of the null hypothesis
that sex is independent of having perfect pitch.
The data can be tabulated as follows.

• Exercise 2
The following table shows the results of a retrospec-
tive study comparing radiation therapy with surgery
in treating cancer of the larynx. The response indi-
cates whether the cancer was controlled for at least
two years following treatment.

76
• Report and interpret the P-value for Fisher’s exact
test with H1 : Ψ > 1 and H1 : Ψ 6= 1. Using the
function fisher.test(), explain how the P-values
were calculated.

• Using R (e.g. write a function that computes the


hypergeometric probability for each entry in the
first cell of the 2 × 2 table), plot the (discrete)
distribution of the P-value from Fisher’s exact test.
Section 9: Polytomous Logistic Regression

77
Polytomous data

If the response of an individual or item in a study is


restricted to one of a fixed set of possible values, we sa
that the response is polytomous.

The k possible values of Y are called the response ca-


tegories.

Often the categories can be defined in a qualitative or


non-numerical way.

We need to develop satisfactory models that distinguish


several types of polytomous response. For instance, if
the categories are ordered, there is no compelling reason
for treat the extreme categories in the same way as the
intermediate ones.

However, if the categories are simply an unstructured


collection of labels, there is no reason a priori to select
a subset of categories for special treatment.

78
Nominal response variable: baseline-category logit mo-
dels
Let Y be a nominal response variable with J categories.
Logit models for nominal responses pair each respon-
se category with a baseline category. The choice of
baseline category is arbitrary.
Given a vector x of explanatory variables
J
X
πj (x) = P (Y = j|x) πj (x) = 1
j=1
If we have n independent observations based on these
probabilities, the probability distribution for the number
of outcomes that occur for each J types is a multinomial
with probabilities
(π1 (x), . . . , πJ (x)).
This model is basically just an extension of the binary
logistic regression model. It gives a simultaneous repre-
sentation of the odds of being in one category relative
to being in another category, for all pairs of categories.
Once the model specifies logits for a certain J − 1 pairs
of categories, the rest are redundant.
If the last category (J) is the baseline, the baseline
category logits model
 
πj (x)
log = αj + βj0 x j = 1, . . . , J − 1
πJ (x)
will describe the effect of x on the J − 1 logits.
79
Notes

Parameters in the (J − 1) equations determine parame-


ters for logits using all other pairs of response categories.
For instance, for an arbitrary pair of categories a and b:
       
πa πa /πJ πa πb
log = log = log − log =
πb πb /πJ πJ πJ

= (αa + βa x) − (αb + βb x)
= (αa − αb ) + (βa − βb )x

80
Alligator Food Choice Example
The data is taken from a study by the Florida Game and
Fresh Water Fish Commission of factors influencing the
primary food choice of alligators.
Primary food type has five categories: Fish, Inverte-
brate, Reptile, Birth and Other.
Explanatory variables are the Lake where alligators were
sampled and the Length of alligator.
food<-factor(c("fish","invert","rep","bird","other"),
levels=c("fish","invert","rep", "bird","other"))
size<-factor(c("<2.3",">2.3"),levels=c(">2.3","<2.3"))
gender<-factor(c("m","f"),levels=c("m","f"))
lake<-factor(c("hancock","oklawaha","trafford","george"),
levels=c("george","hancock", "oklawaha","trafford"))

table.7.1<-expand.grid(food=food,size=size,
gender=gender,lake=lake)

temp<-c(7,1,0,0,5,4,0,0,1,2,16,3,2,2,3,3,0,1,2,3,2,2,0,0,1,
13,7,6,0,0,3,9,1,0,2,0,1,0,1,0,3,7,1,0,1,8,6,6,3,5,2,4,1,1,
4,0,1,0,0,0,13,10,0,2,2,9,0,0,1,2,3,9,1,0,1,8,1,0,0,1)

table.7.1<-structure(.Data=table.7.1[rep(1:nrow(table.7.1),
temp),], row.names=1:219)
We fit several models
library(nnet)
81
fitS<-multinom(food~lake*size*gender,data=table.7.1)
fit0<-multinom(food~1,data=table.7.1) # null
fit1<-multinom(food~gender,data=table.7.1) # G
fit2<-multinom(food~size,data=table.7.1) # S
fit3<-multinom(food~lake,data=table.7.1) # L
fit4<-multinom(food~size+lake,data=table.7.1) # L+S
fit5<-multinom(food~size+lake+gender,data=table.7.1) #L+S+G
The likelihood ratio test for each model:
deviance(fit1)-deviance(fitS)
deviance(fit2)-deviance(fitS)
deviance(fit3)-deviance(fitS)
deviance(fit4)-deviance(fitS)
deviance(fit5)-deviance(fitS)
deviance(fit0)-deviance(fitS)
Collapsing over gender:

fitS<-multinom(food~lake*size,data=table.7.1) # saturated mode


fit0<-multinom(food~1,data=table.7.1) # null
fit1<-multinom(food~size,data=table.7.1) # S
fit2<-multinom(food~lake,data=table.7.1) # L
fit3<-multinom(food~size+lake,data=table.7.1) # L + S

deviance(fit1)-deviance(fitS)
deviance(fit2)-deviance(fitS)
deviance(fit3)-deviance(fitS)
deviance(fit0)-deviance(fitS)
According to the AIC the best model is fit3:
summary(fit3)
In this example the baseline category is the one tha
crosses “fish”, “ > 2.3” and “george”.

Results:

• In the George lake and small alligators, the odd to


choose an invertebrate rather than a fish is exp(1.46)
that is 4.3 times the estimated odd for large alli-
gators. So Length of alligators plays an important
role in determining their primary food choice.

• The estimated odds to choose an invertebrate ra-


ther than a fish are higher in Trafford and Oklawaha
lakes and lower in the Hancock lake all compared
with George lake.

82
Starting from these results we can evaluate all the re-
dundant odds ratios.

Fo example we can try to evaluate the odds of choosing


an “invertebrate” against “other” as:
 
 
πI log ππFI  
πI
 
πO
log =   = log − log =
πO πO
log πF π F π F

= (−1.55 + 1.465Size − 1.66ZH + 0.94ZO + 1.12ZT )−


(−1.90 + 0.335Size + 0.83ZH + 0.01ZO + 1.52ZT ) =
= 0.35 + 1.135 − 2.48ZH + 0.93ZO − 0.39ZT

83
Ordinal response variables: Log-Linear Association
models

Many tables are formed by cross-classifying variables wi-


th ordered categories. These can be categorical but
ordinal, such as Likert scales (for example, Strongly Di-
sagree, Disagree, Neutral, Agree, Strongly Agree) or
continuous variables that have been discretized, such as
income formed into intervals.

Tables with ordered categories allow for models with


different types of association built in, since concepts of
direct and inverse relationships make sense.

This permits parsimonious representation of a lack of


independence.

84
Ordinal response variables: 1. Linear by Linear (Uni-
form) association

Consider a table with rows and columns with ordinal


categories (BOTH) and assume that there exist kno-
wn scores {ui } (for rows) and {vj } (for columns) that
represent that ordering.

These scores could be:

• the actual values of a discrete underlying variable

• a score linked to an underlying continuous variable

• an equispaced representation of a non-numerical,


but ordinal scale (such as a Likert scale).

Most typically ui = i and vj = j.

The LbyL association model is


logµij = λ + λX Y
i + λj + θui vj

with constraints such as λX Y


i = λj = 0.

This can be seen as a special case of saturated model


in which λXY
ij = θui vj .

The uniform association model adds only one parameter


θ to the independence model, focusing all possible lack
of independence on that one parameter.
85
Tables with ordered categories: 1. Linear by Linear
(Uniform) association

• If θ = 0 independence holds.

• If θ > 0 the model implies that a higher expected


cell count occurs when ui and vj either go up TO-
GETHER or go down TOGETHER, so there is a
direct association relationship.

• If θ < 0 the model implies that higher expected


cell counts occur when ui is high and vj is low,
or vice versa, so there is an inverse association
relationship.

The θ parameter has a simple interpretation in terms of


odds ratios: the log odds ratio is directly proportional
to the product of the distance between the rows and
the distance between the columns.

So for example, for the 2 × 2 table using the cells inter-


secting rows a and c with columns b and d, then:
 
µab µcd
log = θ(uc − ua )(vd − vb )
µad µcb

This log odds ratio is stronger as |β| increases and for


pairs of categories that are farther apart. So, when
ui = i and vj = j the local odds ratios for adjacent rows
and adjacent columns have common value of eθ .
86
Tables with ordered categories: 2. Row and Column
Effects Models

The uniform association model assumes prespecified row


and column scores. Sometimes either the rows or co-
lumns (but not both) are not ordinal, so such scores
don’t exist for the nominal variable.

Another possibility is that equispaced scores are not ap-


propriate for a set of rows or columns, and it is conve-
nient to estimate appropriate scores based on the ob-
served data (for example, for the Likert scaled rows and
columns it might be that “Strongly disagree” is closer
to “Disagree” than “Disagree” is to “Neutral”).

Models that can fit tables of his type are the row effects
and column effects models.

87
The row effects model R has the form
logµij = λ + λX Y
i + λj + τi vj

Constraints are needed such as λX Y


I = λJ = τI = 0. The
{τi } are called row effects. This model has (I-1) more
parameters than the independence model.

Independence can be seen as a special case in which


τ 1 = τ2 = . . . = τI .

The row effects model treats the column as ordinal with


known scores and rows as nominal, since τ can take on
any values that sum to zero.

For this class of models for any pairs of rows r < s and
columns c < d the log of the odds ratio formed from the
2 × 2 table of those rows and columns is
 
µrcµsd
log = (τs − τr )(vd − vc )
µrd µsc

The log odds ratio is proportional to the distance bet-


ween the columns, with the constant of proportionality
being τs − τr .

88
The column effects model C takes the form
logµij = λ + λX Y
i + λj + ρj ui

where the ρ parameters sum to zero.

This model treats the rows as ordinal with known scores


and columns as nominal. Here the quantity ρd − ρc is a
measure of the closeness of the columns c and d with
respect to the conditional distribution of the rows given
the column.

89
A generalization of he row and column effects models
that allows for both row and column effects in the local
odds ratio is the row + column effects model (R+C)
logµij = λ + λX Y
i + λj + τi vj + ρj ui

The local log odds ratio for unit-spaced row and column
scores is
(τi+1 − τi ) + (ρj+1 − ρj )
incorporating row effects and column effects.

90
L×L model Example
library(gnm)
library(vcdExtra)
data(Mental) #or in the same way
dati<-expand.grid(mental=c("well","mild",
"moderate","impaired"),ses=1:6)
dati$Freq=c(64,94,58,46,57,94,54,40,57,105,65,60,
72,141,77,94,36,97,54,78,21,71,54,71)
Display the frequency table
Mental.tab <- xtabs(Freq ~ mental+ses, data=Mental)
Fit Independence model
indep <- glm(Freq ~ mental+ses,family = poisson, data = Mental)
deviance(indep) #or
o<-glm(Freq~factor(mental)+factor(ses), family=poisson, data=dati)
deviance(o)

Fit a Linear by Linear Model: use integer scores for rows


and cols
Cscore <- as.numeric(Mental$ses)
Rscore <- as.numeric(Mental$mental)

linlin <- glm(Freq ~ mental + ses + Rscore:Cscore,


family = poisson, data = Mental)

Or
linlin2<-glm(formula = Freq ~ factor(mental) + factor(ses) +
as.numeric(mental):as.numeric(ses),
family = poisson, data = dati)

Now compare models


anova(indep,linlin)
AIC(indep,linlin)

91
Row effects model Example
roweff <- glm(Freq ~ mental + ses + mental:Cscore,
family = poisson, data = Mental)

roweff <- glm(Freq ~ factor(mental)+factor(ses) + mental:Cscore,


family = poisson, data = dati)

92
Column effects model Example
coleff <- glm(Freq ~ mental + ses + Rscore:ses,
family = poisson, data = Mental)

coleff <- glm(Freq ~ factor(mental)+factor(ses) + Rscore:ses,


family = poisson, data = dati)

93
Exercise: student perception of statistics class
assessment methods

Aim: the study of the association between the me-


thods used in class assessment (Structured computer
assignments, Open-ended assignments, Article analysis,
Annotating output) and the amount students learned
(Didn’t learn anything, Learned a little bit, Learned
enough to be comfortable with topic, learned a great
deal).
dati<-expand.grid(response=gl(4,1),
assignments=gl(4,1,labels = c("Structured", "Open","ArtAnaly",
"Annotoutput")))
dati$Freq<-c(0,3,8,3,0,1,7,6,1,6,4,2,0,4,8,2)
# display the frequency table
(assign.tab <- xtabs(Freq ~response+assignment, data=dati))
chisq.test(assign.tab) #test for independence

In this specific case LbyL model, R model and R+C


model cannot be applied because the method used for
assessments is a NOMINAL variable.

The only model that makes sense is a column effects


model:
Rscore <- as.numeric(dati$response)

coleff <- glm(Freq ~ as.factor(response) + as.factor(assignments)


+ Rscore:assignments,family = poisson, data = dati)

94
Ordinal response variables: 1. Cumulative Logit Models
The logits of the first J − 1 cumulative probabilities are:
 
P (Y ≤ j|x)
logit[P (Y ≤ j|x)] = log =
1 − P (Y ≤ j|x)
 
π1 (x) + π2 (x) + . . . + πj (x)
= log j = 1, . . . , J − 1
πj+1 (x) + . . . + πJ (x)

A model for the j-th cumulative logit looks like an or-


dinary logit model for a binary response in which cate-
gories 1 to j combine to form a single category, and
categories j + 1 to J form a second category.
It is possible to consider parsimonious models that con-
sider all the J − 1 cumulative logits in a single model:
Proportional Odds Model.
A Proportional Odds Model assumes the following struc-
ture:
logit[P (Y ≤ j|x)] = αj + β T x j = 1, . . . , J − 1
It considers:

• different intercepts for each cumulative logit and


these intercepts will be an increasing function with
j;

• a parameter β describing the effect of X on the log


odds of response in category j or below; it assumes
95
an identical effect of X for all J − 1 cumulative
logits.

This means that when this model fits well, it requires a


single parameter rather than J −1 parameters to describe
the effect of X.

This class of models is called Proportional Odds Model


because it satisfies
logit[P (Y ≤ j|x1 )] − logit[P (Y ≤ j|x2 )] =

P (Y ≤ j|x1 )/P (Y > j|x1 )


= log = β T ( x1 − x 2 )
P (Y ≤ j|x2 )/P (Y > j|x2 )
in other words, the cumulative log odds is proportional
to the distance between x1 and x2 , that is, the odd to
give an answer ≤ j when X = x1 is exp[β(x1 − x2 )] times
the odd in X = x2 and this value will be equal for all
the logits.
Comments:

• When the model holds with β = 0, X and Y are


statistically independent;

• Explanatory variables in cumulative logit models


can be continuous, categorical or of both types.

• The ML fitting process uses an iterative algorithm


simultaneously for all j.

96
For simplicity, let’s consider only one predictor:
logit[P (Y ≤ j)] = αj + βx

Then the cumulative probabilities are given by:


P (Y ≤ j) = exp(αj + βx)/(1 + exp(αj + βx))
and since β is constant, the curves of cumulative pro-
babilities plotted against x are parallel.

97
Cheese-Tasting Example (McCullagh and Nelder,
1989)

In this example, subjects were randomly assigned to ta-


ste one of four different cheeses. Response categories
are 1=strong dislike to 9=excellent taste.

By inspection, we can see that D is the most preferable,


followed by A, C and B.

Let’s try to model these data by a proportional-odds


cumulative-logit model with three dummy codes to di-
stinguish among the four chesses.

98
• How many logit models?
(J −1)∗(k−1) where Jis the number of the response
categoris and K the number of regressors in the
model;

• The model will have 8 intercepts (one for each of


the logit equations) and 3 slopes, for a total of 11
free parameters.

• By comparison, the saturated model, which fits


a separate 9-category multinomial distribution to
each of the four cheeses, has 4 × (9 − 1) = 32 free
parameters.

• Therefore, the overall goodness-of-fit test will have


32-11 = 21 degrees of freedom.

99
The vglm() function

The VGAM library in R contains the vglm() function


useful in order to fit several models. Possible models
include the cumulative logit model (family function cu-
mulative) with proportional odds or partial proportional
odds or nonproportional odds, cumulative link models
(family function cumulative) with or without common
effects for each cutpoint, adjacent-categories logit mo-
dels (family function acat), and continuation-ratio logit
models (family functions cratio and sratio).

The vglm() function needs the response variable spe-


cified in its “cbinded” form (in its Full Disjunctive Co-
ding).

The syntax of the vglm() function is very similar to the


standard glm().
An important difference is that the weigths argument
unlike glm() has not to be a vector of frequencies but
weights defined a priori.

100
library(VGAM)
cheese <- read.table("cheese.dat.txt",
col.names=c("Cheese", "Response", "N"))
is.factor(cheese$Response)
cheese$Response<-factor(cheese$Response, ordered=T)
mod.sat<-vglm(Response~Cheese,cumulative,
weights=c(N+0.5),data=cheese)

mod.podds<-vglm(Response~Cheese,cumulative(parallel=TRUE),
weights=c(N+0.5),data=cheese)

summary(mod.sat)
summary(mod.podds)
matplot(t(mod.podds@predictors[seq(1,36,by=9),]),type="l",
ylab="Cumulative logits",main="Proportional odds model")
#Add a legend will be surely useful!
matplot(t((exp(mod.podds@predictors)/(1+exp(mod.podds@predictors)))
[seq(1,36,by=9),]),type="l",ylab="Cumulative Probability Curves",
main="Proportional odds model")

101
In this case, a positive coefficient β means that in-
creasing the value of X tends to lower the response
categories (i.e. produce greater dislike).
summary(mod.podds)

Call:
vglm(formula = Response ~ Cheese, family = cumulative(parallel = TRUE),
data = cheese, weights = c(N + 0.5))

Coefficients:
Estimate Std. Error z value
(Intercept):1 -4.84428 0.45697 -10.60089
(Intercept):2 -3.84779 0.37446 -10.27564
(Intercept):3 -2.86231 0.32751 -8.73959
(Intercept):4 -1.91322 0.29232 -6.54497
(Intercept):5 -0.73965 0.25589 -2.89044
(Intercept):6 0.10951 0.24755 0.44237
(Intercept):7 1.44853 0.28180 5.14020
(Intercept):8 2.89229 0.36928 7.83216
CheeseB 2.82260 0.38300 7.36978
CheeseC 1.44005 0.34794 4.13883
CheeseD -1.39122 0.35218 -3.95026

Residual deviance: 817.3119 on 277 degrees of freedom

Log-likelihood: -408.656 on 277 degrees of freedom

102
CheeseB 2.82260 0.38300 7.36978
CheeseC 1.44005 0.34794 4.13883
CheeseD -1.39122 0.35218 -3.95026

The second part of the output is the coefficient esti-


mates for the three dummy variables. The estimated
slope for the first dummy variable, labeled cheese B, is
2.82260. This indicates that cheese B does not taste
as good as cheese A. Looking at all three coefficients,
and noting that cheese A is the reference category such
that β2 compares cheese C to A and β3 cheese D to A,
we see that the implied ordering of cheeses in terms of
quality is D > A > C > B. Furthermore, D is signifi-
cantly better preferred than A, but A is not significantly
better than C.

The first part of the output includes the estimated in-


tercepts. The first parameter is the estimated log-
odds of falling into category 1 (strong dislike) versus
all other categories when all X-variables are zero. Be-
cause X1 = X2 = X3 = 0 when cheese=A, the esti-
mated log-odds of better taste for cheese A are exp(-
4.84428). From the above output, the first estimated
logit equation then is

P (Y ≤ 1)
logit[P (Y ≤ 1] = log =
P (Y > 1)

= −4.84428 + 2.82260X1 + 1.44005X2 − 1.39122X3

103
104
Ordinal response variables: 2. Adjacent-Category Logi-
ts models

Adjacent-Category Logits models can be defined as:


 
πj
logit[P (Y = j|Y = jorj+1)] = log , j = 1, . . . , J−1
πj+1
 
πj
log = αj + βx j = 1, . . . , J − 1
πj+1
with a common effect β.

Also in this case a set of logit will be defined and starting


from them it will be possible to derive all the J2 pairs

of response categories.

An Adjacent-Category Logits model can be seen as a


baseline logit where the baseline changes for each cate-
gory.

105
Job Satisfaction Example

Aim of this example is to study the relationship between


job satisfaction (Very Dissatisfied, Little Satisfied, Mo-
derately Satisfied, Very Satisfied) and income (< 5.000,
5.000 − 15.000, 15.000 − 25.000, > 25.000) stratified by
gender (1=female, 0=males), for black Americans.

For simplicity, we use job satisfaction scores and income


scores 1, 2, 3, 4.

The fitted model will be


log(πj /πj+1 ) = αj + β1 x + β2 g j = 1, 2, 3

IIt describes the odds of being very dissatisfied instead of


a little satisfied, a little instead of moderately satisfied,
and moderately instead of very satisfied. This model
is equivalent to the baseline-category logit model with
reference category 4, ovvero

log(πj /π4 ) = α∗j + β1 (4 − j)x + β2 (4 − j)g j = 1, 2, 3

106
In order to fit an Adjacent-Category Logit model in
R we have to specify the acat family specifying the Link
function applied to the ratios of the adjacent categories
probabilities (loge) and parallel=TRUE A logical if in the
formula some terms are assumed to have equal/unequal
coefficients.

table.7.8<-read.table("jobsat.txt", header=TRUE)
table.7.8$jobsatf<-ordered(table.7.8$jobsat,
labels=c("very diss","little sat","mod sat",
"very sat"))

table.7.8a<- data.frame(expand.grid(income=1:4,
gender=c(1,0)),unstack(table.7.8,freq~jobsatf))

library(VGAM)

fit.vglm<-vglm(cbind(very.diss,little.sat,
mod.sat,very.sat)~gender+income,
family= acat(link="loge",parallel=T,reverse=T),
data=table.7.8a)

summary(fit.vglm)

107
summary(fit.vglm)

Coefficients:
Estimate Std. Error z value
(Intercept):1 -0.550668 0.67945 -0.81046
(Intercept):2 -0.655007 0.52527 -1.24700
(Intercept):3 2.025934 0.57581 3.51842
gender 0.044694 0.31444 0.14214
income -0.388757 0.15465 -2.51372

Number of linear predictors: 3

Names of linear predictors:


log(P[Y=1]/P[Y=2]), log(P[Y=2]/P[Y=3]), log(P[Y=3]/P[Y=4])

Dispersion Parameter for acat family: 1

Residual deviance: 12.55018 on 19 degrees of freedom

The ML fit gives beta ˆ 1 = −0.389(SE = 0.155) and


β̂2 = 0.045(SE = 0.314). For this parameterization,
ˆ 1 < 0 means the odds of lower job satisfaction de-
beta
crease as income increases. Given gender, the estimated
odds of response in the lower of two adjacent catego-
ries multiplies by exp(−0.389) = 0.68 for each category
increase in income. The model describes 24 logits (th-
ree for each income × gender combination) with five
parameters. Its deviance G2 = 12.6 with df = 19. This
model with a linear trend for the income effect and a
lack of interaction between income and gender seems
adequate.

108
Ordinal response variables: 3. Continuation-Ratio Lo-
gits

Continuation-ratio logits can be defined as


 
πj
log j = 1, . . . , J − 1
πj+1 + πj+2 + . . . + πJ
or also
 
πj+1
log j = 1, . . . , J − 1
π 1 + π2 + . . . + πj
They are useful when the response variable represents a
sequential mechanism such as the survival as a function
of age.

Let ωj = P (Y = j|Y ≥ j), given the vector of explana-


tory variables x
πj (x)
ωj (x) = j = 1, . . . , J − 1
πj (x) + . . . + πJ (x)
h i
ωj (x)
and continuation-ratios became ordinary logits log 1−ωj (x)
.

109
Esempio: Streptococcus e grandezza delle tonsille

Aim of the study is to investigate the relationship bet-


ween tonsils size (Not enlarged, Enlarged, Greatly En-
larged) and the presence of Streptococcus (1 = yes, 0
= no). Let x be the indicator variable about the pre-
sence of Streptococcus pyogenes; then the continuatio
logit model will be
 
π1
log = α1 + βx
π 2 + π3
 
π2
log = α2 + βx
π3

where in the first part a common value of the cumulative


odds ratio will be estimated while in the second part we
will estimate a local odds rartio.

carrier<-c(1,0)
y1<-c(19,497)
y2<-c(29,560)
y3<-c(24,269)
tonsil<-cbind(carrier,y1,y2,y3)
tonsil<-as.data.frame(tonsil)
tonsil$carrier<-as.factor(tonsil$carrier)

library(VGAM)
fit.cratio<-vglm(cbind(y1,y2,y3)~carrier,
family=cratio(reverse=FALSE, parallel=TRUE),
110
data=tonsil)
summary(fit.cratio)
fitted(fit.cratio)

The model goodness of fit shows an adequacy of the


fitted model (deviance 0.01, df = 1); β̂ = −0.528(SE =
0.197)

For Streptococcus carriers the odd of having “Enlarged”


tonsils vs “Greatly Enlarged” is 0.59 (exp(-0.528)) the
odd of not carriers.
Section 10: Three ways table and Simpson

111
Date tre variabili categoriali X, Y e Z, esaminiamo il set
di tutti i possibili modelli log-lineari gerarchici su πijk ,
dove
πijk = P [X = xi , Y = yj , Z = zk ] µijk = nπijk

D’ora in avanti useremo la notazione in classi generatrici


secondo la quale un modello verrà indicato con l’inte-
razione di ordine massimo inclusa. Cosı̀ per esempio il
modello
(XY, XZ)
descrive il modello con interazioni massime date da X ∗Y
and X ∗ Z e effetti principali X, Y e Z.

Questa notazione richiama il modo di scrivere modelli


in R. Un modello come quello sopra per esempio in R
potrebbe essere scritto come

...formula= freq ~ X*Y + X*Z,

112
Modello Saturo (XYZ)

log(πijk ) = λ + λX Y Z XY XZ YZ XY Z
i + λj + λk + λij + λik + λjk + λijk
In questo caso tutta l’informazione sulla cella ijk è data
dalla frequenza di cella nijk .

L’adattamento per questo modello soddisfa la condizio-


ne
nijk
π̂ijk = µ̂ijk = nijk
n

Il numero di parametri nel modello è

Parametri ] termini
λ 1
λXi I-1
λYj J-1
λZk K-1
λXY
ij (I-1)(J-1)
λXZ
ik (I-1)(K-1)
λYjkZ (J-1)(K-1)
λXY
ijk
Z (I-1)(J-1)(K-1)
Totale IJK

Poichè il numero di parametri nel modello è uguale al


numero di celle nella tabella a tre vie i gradi di libertà
in questo modello sono 0.

113
Modello di assenza di interazione del secondo ordine o
di associazione omogenea
(XY,XZ,YZ)

log(πijk ) = λ + λX Y Z XY XZ YZ
i + λj + λk + λij + λik + λjk
Per questo modello le equazioni di massima verosimi-
glianza implicano che
µ̂ij+ = nij+ µ̂i+k = ni+k µ̂+jk = n+jk

I gradi di libertà sono pari a df = (I − 1)(J − 1)(K − 1),


ovvero pari al numero di parametri λ posti pari a 0 nel
modello log-lineare saturo.

114
Modello di indipendenza condizionata
(XY,XZ), (XY,YZ) o (XZ,YZ)
E’ possibile definire tre diversi modelli di indipendenza
condizionata, i quali possono essere derivati dal modello
saturo eliminando le interazioni del secondo ordine e una
delle interazioni di ordine 1. Una delle forme possibili
(per il modello (XY,YZ)) è
log(πijk ) = λ + λX Y Z XY YZ
i i + λj + λk + λij + λjk

in questo caso λXY


ijk
Z e λXZ sono posti uguale a zero.
ik

In questo esempio X e Z risultano condizionatamente


indipendenti dati i livelli della variabile Y. I df di questo
modello sono df = (I − 1)(K − 1)J.
E’ possibile dimostrare che sotto l’ipotesi di indipenden-
za condizionata
P [X = xi , Z = zk |Y = yj ] = P [X = xi |Y = yj ]P [Z = zk |Y = yj ]
per ogni livello di Y.
Possiamo dunque pensare ad un modello di indipendenza
condizionata come J different I × K tabelle le quali,
fissando i livelli di Y e classificando X e Z, esibiscono
indipendenza.
Pertanto una conseguenza di ciò è che
P [X = xi , Y = yj , Z = zk ] =
115
P [X = xi |Y = yj ]P [Z = zk |Y = yj ]P [Y = yj ]

Inoltre sotto l’ipotesi di indipendenza condizionata val-


gono le seguenti relazioni
nij+ n+jk
µ̂ij+ = nij+ µ+jk = n+jk µ̂ijk =
n+j+
Modello di indipendenza multipla
(XY,Z), (XZ,Y ), o (YZ,X)

I modelli di indipendenza multipla si ottengono elimi-


nando il termine di interazione del secondo ordine e due
delle interazioni di primo ordine o allo stesso modo si
include nel modello una sola interazione del primo ordi-
ne e tutti gli effetti principali.
E’ possibile pertanto definire tre diversi modelli di questa
forma.

La formulazione del modello (X,YZ) sarà per esempio


del tipo:
log(πijk ) = λ + λX Y Z YZ
i + λj + λk + λjk

Questo modello ha df = (I − 1)(JK − 1)

In questi modelli si assume l’indipendenza tra una varia-


bile e la combinazione delle altre 2 variabili.

Per questa classe di modelli è possibile dimostrare che


ni++ n+jk
µ̂i++ = ni++ µ̂+jk = n+jk µijk =
n+++

116
Modello di indipendenza mutua
(X,Y,Z)

Questo modello include solo gli effetti principali. la


forma del modello log-lineare è:
log(πijk ) = λ + λX Y Z
i + λj + λk

I gradi di libertà di un modello di questo tipo saranno:


df = IJK−(I − 1) + (J − 1) + (K − 1) + 1 = IJK−I−J−K−2

Sotto questo modello si ha la mutua indipendenza di


tutte e tre le variabili il che implica che
µ̂i++ = ni++ µ̂+j+ = n+j+ µ̂++k = n++k
e
ni++ n+j+ n++k
µ̂ijk =
(n+++ )2

117
Il paradosso di Simpson

Il paradosso di Simpson è in statistica la situazione in


cui una relazione tra due fenomeni viene apparentemen-
te modificata o persino invertita dai dati in possesso
a causa di altri fenomeni non presi in considerazione
nell’analisi.

È alla base di frequenti errori nelle analisi statistiche


nell’ambito delle scienze sociali e mediche, ma non solo.

George Udny Yule lo descrisse nell’articolo “Notes on the


theory of association of attributes in Statistics”, com-
parso in Biometrika nel 1903 e E. H. Simpson con l’ar-
ticolo “The interpretation of interaction in contingen-
cy tables” nel Journal of the Royal Statistical Society
(1951).

118
Esempio 1

Il caso qui oggetto di studio riguarda l’esito giudiziale di


326 processi per omicidio in Florida, stato in cui vige la
pena di morte. Lo studio fu commissionato da un’asso-
ciazione per la difesa dei diritti dei neri, che sosteneva la
presenza di discriminazione razziale nel comportamento
dei giudici e delle giurie, naturalmente a sfavore degli
imputati neri.

Furono prese in considerazione tre variabili dicotomiche:


A = razza dell’imputato, B = esito giudiziale (pena di
morte), C = razza della vittima.

Lo scopo dello studio era vagliare l’ipotesi che il sistema


giudiziale statale penalizzasse gli imputati neri.

Per prima cosa studiamo le associazioni marginali e par-


ziali all’interno della tabella per mezzo degli odds ratio.

Guardando alla semplice relazione (marginale) tra razza


dell’imputato ed esito del processo appare che i bian-
chi sono più spesso condannati alla pena di morte di
quanto non lo siano i neri. Infatti per i bianchi la sti-
ma della probabilità di essere condannati alla pena di
morte è di 19/160 = 0.1187 mentre per i neri è di
17/166 = 0.1024. Alla stessa conclusione si perviene va-
lutando l’associazione tra la razza dell’imputato e l’esito
del processo per mezzo dell’odds ratio (marginale):
ΦAB = (19 ∗ 149)/(17 ∗ 141) = 1.181

119
Infatti l’odds stimato per la pena di morte è 1.181 volte
più alto per gli imputati bianchi che per gli imputati ne-
ri. Questo contraddice assolutamente la tesi sostenuta
dell’Organizzazione per la difesa dei neri!

Sempre in tema di associazioni marginali guardiamo ades-


so alla relazione tra razza della vittima ed esito del pro-
cesso. Se la vittima è bianca la pena di morte è com-
minata 30 volte su 214 (cioè il 14, 01% delle volte);
mentre se la vittima è nera la condanna alla pena di
morte avviene solo nel 5.36% dei casi (6/112). Infatti
l’odds ratio marginale ci segnala che:

ΦBC = 2.71

e cioè la quota della pena di morte per le vittime bianche


è 2.71 volte più alta di quella per le vittime nere. Il
fattore razza della vittima sembra dunque essere molto
importante nel valutare la realtà dei fatti!

In effetti, condizionandoci alla razza della vittima, si os-


serva che se la vittima è un bianco (C=C1) la pena di
morte è comminata il 12.58% delle volte per gli impu-
tati bianchi e il 17.46% delle volte per gli imputati neri,
quindi il 4.9% delle volte in più per gli imputati neri (una
differenza evidente ma non cosı̀ elevata come si poteva
pensare all’inizio). Se la vittima è un nero invece, co-
munque le condanne alla pena di morte sono molto po-
che ed in particolare nessuna tra i nove imputati bianchi
e pari al 6.18% per gli imputati neri. Questa struttura
della relazione tra le variabili, controllando per la raz-
za della vittima, si osserva anche calcolando i seguenti
odds ratio condizionati:

ΨAB|C=c1 = (19/132)/(11/52) = (19∗52)/(11∗132) = 0.680

ΨAB|C=c2 = (0.5 ∗ 97.5)/(6.5 ∗ 9.5) = 0.789

cioè se la vittima è bianca l’odds della pena di morte


per gli imputati bianchi è 0.68 volte più alto di quel-
lo per gli imputati neri, mentre se la vittima è un ne-
ro l’odds per i bianchi è 0.79 volte più alto di quello
per gli imputati neri. Questi due valori possono so-
stanzialmente ritenersi uguali o comunque molto simili
(log(0.680) − log(0.789) = −0.1486735); per esprimere
una valutazione più circostanziata, possiamo studiare la
significatività dei singoli odds ratio, calcolando gli inter-
valli di confidenza asintotici per gli odds ratio stessi o
per i loro logaritmi:

AB.C1<-matrix(c(19,132,11,52),2,2,byrow=T)
AB.C1

AB.C1<-as.table(AB.C1)
AB.C1
AB.C2<-matrix(c(0,6,9,97),2,2)
AB.C2
AB.C2<-as.table(AB.C2)
AB.C2
library(vcd)
summary(oddsratio(AB.C1,log=FALSE))
summary(oddsratio(AB.C2,log=FALSE))
summary(oddsratio(AB.C1))
summary(oddsratio(AB.C2))

Poiché gli intervalli di confidenza per l’odds ratio conten-


gono entrambi il valore uno (o analogamente gli IC per
i log(OR) contengono il valore zero) non sembra esser-
ci evidenza per una associazione significativa tra razza
dell’imputato e pena di morte quando ci si condiziona
alla razza della vittima.

Emerge dunque che è la razza della vittima a fare la


differenza: il tribunale non è di per sé più severo verso
gli imputati neri rispetto a quanto lo sia con gli imputa-
ti bianchi, ma il vero razzismo consiste nel considerare
diversamente la gravità del reato sulla base della razza
della vittima. L’omicidio di un bianco è infatti punito
con più severità di quello di un nero.

L’esito (marginale) più sfavorevole per gli imputati bian-


chi si spiega dunque con la forte associazione tra raz-
za dell’imputato e razza della vittima (per cui i bianchi
tendono ad uccidere maggiormente bianchi e i neri i
neri):

ΦAC = (151 ∗ 103)/(63 ∗ 9) = 27.430


L’associazione tra la razza della vittima (C) e la razza
dell’imputato (A) è piuttosto forte. L’odds di uccidere
un bianco è di 27.43 volte più alto per un imputato
bianco che per un nero.
Vediamo adesso di valutare questa complessa strut-
tura di associazione con un approccio modellistico.
Trasformiamo per prima cosa i dati nel formato vetto-
riale compatibile per R:
R.impu<-gl(2,4)
R.vitt<-gl(2,2,8)
Pena<-gl(2,1,8)

levels(R.impu)<-c("BI","NE")
levels(R.vitt)<-c("BI","NE")
levels(Pena)<-c("SI","NO")

freq<-c(19,132,0,9,11,52,6,97)

Pena.morte<-data.frame(R.impu,R.vitt,Pena,freq)
xtabs(freq~Pena+R.impu+R.vitt, Pena)
A puro scopo didattico si elenca in seguito la sintassi
dei diversi modelli di associazione per tabelle 2 × 2 × 2,
anche se dalle analisi preliminari fatte in termini di odds
ra tio dovrebbe essere già chiaro qual è la tipologia dei
modelli su cui ci si può orientare.
NOTA: se pensiamo al modello nella sua scrittura per
classi generatrici, sarà particolarmente semplice costrui-
re la giusta sintassi per la formula del glm.
• Modello di indipendenza Mutua [A] [B] [C]

mod1.glm<-glm(freq~R.impu+R.vitt+Pena,
family=poisson)
mod1.glm
summary(mod1.glm)

• Modello di indipendenza Multipla [B] [AC]


Stiamo valutando l’indipendenza della pena di mor-
te congiuntamente da razza dell’imputato e della
vittima (si potrebbero stimare anche [A] [BC] e [C]
[AB] , ma qui interessa soltanto mettere in evidenza
la sintassi per modelli di questa classe);

mod2.glm<-glm(freq~Pena+R.impu*R.vitt,
family=poisson)
summary(mod2.glm)

• Modello di indipendenza Condizionata [BC] [AC]

mod3.glm<-glm(freq~Pena*R.vitt+R.impu*R.vitt,
family=poisson)
summary(mod3.glm)

• Modello di assenza di interazione del secondo


ordine [AB] [BC] [AC]
La sintassi di questo modello si può scrivere in due
modi. Nell’usuale modo per classi generatrici:

mod4bis.glm<-glm(freq~Pena*R.vitt+R.impu*R.vitt+
Pena*R.impu,family=poisson)

oppure:

mod4.glm<-glm(freq~(Pena+R.vitt+R.impu)^2 ,
family=poisson)
summary(mod4.glm)

dove si indica appunto come “esponente” il numero


dei fattori implicati nell’interazione di ordine più alto
presente nel modello.

• Modello Saturo [ABC]

modsaturo.glm<-glm(freq~Pena*R.impu*R.vitt,
family=poisson)
modsaturo.glm<-glm(freq~(Pena+R.impu+R.vitt)^3,
family=poisson)
modsaturo.glm
summary(modsaturo.glm)
Dalla tabella si deduce che i modelli peggiori all’interno
di ogni classe (quelli in grigio) sono quelli che escludono
l’associazione tra razza dell’imputato (A) e razza della
vittima (C). Il modello migliore è [BC][AC] rispetto a G2
e parsimonia dei parametri (AIC), ovvero il modello di
indipendenza condizionata che avevamo già individuato
studiando gli odds ratio condizionati!

120
Esempio 2
The data record details about the Birth to Ten study
(BTT), perfomed in the greater Johannesburg/Soweto
metropolitan area of South Africa during 1990. In the
study, all mothers of singleton births were interviewed
during a seven-week period between April and June to
women with permanent addresses in a defined area (a
total of 4019 births). Five years later, 964 of the-
se mothers were re-interviewed. If the mothers inter-
viewed later and representative of the original popula-
tions, the two groups should show similar characteri-
stics. One of those characteristics is documented here:
the proportion with and without medical aid.
There are eight observations on four variables.
Variables:

• Counts: The number of subjects in the given clas-


sification;

• Group: which group the mother belonged to (1


refers to the mothers not followed up after the five
years, 2 refers to the mothers followed-up five years
later);

• MedicalAid: whether or not the mother had medi-


cal aid;

• Race The mothers’ race.


121
btt<-read.table("bttstudy.dat",header=TRUE)
btt$Group<-as.factor(btt$Group)
str(btt)
attach(btt)
A=Group B=medicalAid C=Race
AB<-xtabs(Counts~Group+MedicalAid)
AC<-xtabs(Counts~Group+Race)
BC<-xtabs(Counts~MedicalAid+Race)

AB.C1<-xtabs(Counts~Group+MedicalAid+
Race)[,,1] #Black
AB.C2<-xtabs(Counts~Group+MedicalAid+
Race)[,,2] #White

AC.B1<-xtabs(Counts~Group+MedicalAid+
Race)[,1,] # No Medical aid
AC.B2<-xtabs(Counts~Group+MedicalAid+
Race)[,2,] # Yes Medical Aid

BC.A1<-xtabs(Counts~Group+MedicalAid+
Race)[1,,] #Gruppo1
BC.A2<-xtabs(Counts~Group+MedicalAid+
Race)[2,,] # Gruppo2

library(vcd)
phiAB<-oddsratio(AB) # -0.4713
phiBC<-oddsratio(BC) # 3.903125
phiAC<-oddsratio(AC) # -1.398151
phiAB.C1<-oddsratio(AB.C1) # 0.02837989
phiAB.C2<-oddsratio(AB.C2) # 0.05608947
phiAC.B1<-oddsratio(AC.B1) #-1.442175
phiAC.B2<-oddsratio(AC.B2) #-1.414465
phiBC.A1<-oddsratio(BC.A1) # 3.906292
phiBC.A2<-oddsratio(BC.A2) # 3.934002

Vediamo adesso di valutare questa complessa struttura


di associazione con un approccio modellistico.
Modello di indipendenza Mutua [A] [B] [C]
mod1.glm<-glm(Counts~Group+MedicalAid+Race,
family=poisson,data=btt)
summary(mod1.glm)
Modello di indipendenza Multipla [A] [BC]
mod2.glm<-glm(Counts~Group+MedicalAid*Race,
family=poisson,data=btt)
summary(mod2.glm)
Modello di indipendenza Multipla [B] [AC]
mod3.glm<-glm(Counts~Group*Race+MedicalAid,
family=poisson,data=btt)
summary(mod3.glm)
Modello di indipendenza Multipla [C] [AB]
mod4.glm<-glm(Counts~Group*MedicalAid+Race,
family=poisson,data=btt)
summary(mod4.glm)

Confronto delle tre residual deviance

c(mod2.glm$deviance,mod3.glm$deviance,
mod4.glm$deviance)
Parte 12: Rappresentazioni grafiche per dati
categoriali

If I can’t picture it, I can’t understand it.


Albert Einstein

Getting information from a table is like extracting sun-


light from a cucumber.
Farquhar & Farquhar, 1891

122
Diagramma a barre
E’ la più semplice e immediata rappresentazione per una
variabile categoriale.
I diagrammi a barre sono costituiti da rettangoli (o bar-
re) aventi larghezza arbitraria, ma costante, e altezza
proporzionale alla caratteristica che si vuole rappresen-
tare.
Normalmente un diagramma a barre presenta sull’asse
orizzontale le etichette (le modalità) che identificano le
classi in cui è stata suddivisa la “popolazione” oggetto
di studio e sull’asse verticale viene riportata la frequenza
(assoluta o relativa) osservata per ciascuna classe.
Il diagramma a barre può essere ottenuto in R passando
alla generica funzione plot() una variabile categoriale (di
tipo factor).
SepLeng.categ<-cut(iris$Sepal.Length, breaks=c(4,5,6,7))
plot(SepLeng.categ)
Si osservi che per default le etichette assegnate alle ca-
tegorie sono quelle dichiarate come livelli della variabile.
In questo caso sono di tipo (a, b], (oppure [a, b) se si
sceglie come parametro di cut l’opzione right=FALSE).
Se si vuole assegnare alla categoria un nome per rendere
il grafico più leggibile si può passare alla funzione cut()
il parametro labels:
mie.etichette<-c("corta","media","lunga", "molto lunga")
SepLeng.categ<-cut (iris$Sepal.Length,
breaks=c(4,5,6,7,8), labels=mie.etichette)
plot(SepLeng.categ)
123
Esiste tuttavia una funzione specifica che permette di
ottenere diagrammi a barre più completi e personalizza-
ti: barplot().

L’argomento height: può essere una matrice o un vetto-


re; se vettore, height rappresenta il vettore di frequenze
da rappresentare, pertanto il diagramma avrà un numero
di barre pari alla lunghezza del vettore; se è una matrice
e beside=FALSE (default) allora ogni barra del grafico rap-
presenta una colonna della matrice; se ancora height è
una matrice e beside=TRUE allora le colonne della matrice
saranno sovrapposte piuttosto che accostate.

Si consideri ad esempio il dataset VADeaths contenen-


te i tassi di mortalità in Virginia nel 1940 classificati
per gruppi di età (riga) e sesso e tipo di popolazione
(colonne).

par(mfrow=c(1,2))
barplot(VADeaths)
barplot(VADeaths,beside = TRUE)

L’argomento width: può essere un numero o un vetto-


re; se è un numero indica l’ampiezza costante di ciascu-
na barra, se è un vettore indica l’ampiezza di ciascuna
barra;

L’argomento beside: valore logico TRUE se vogliamo


rappresentare le barre accostate, FALSE (default) se
vogliamo le barre sovrapposte;

124
Ovviamente se l’oggetto da rappresentare è un vettore
non ha senso utilizzare l’argomento beside.

d<-c(11,58,65,32,42,55,2,18,70,26)
par(mfrow=c(1,2))
barplot(d,width=3)
barplot(d,width=c(1:length(d)))

L’argomento space: vettore opzionale per indicare lo


spazio tra le barre; se height è una matrice e beside=TRUE
potremmo passare a space due valori: il primo indica lo
spazio tra le barre dello stesso gruppo e il secondo lo
spazio tra i gruppi. Di default è pari a c(0,2) se height
è una matrice e beside=TRUE a 0.2 negli altri casi.

par(mfrow=c(2,2))
barplot(VADeaths)
barplot(VADeaths,beside=TRUE)
barplot(VADeaths,beside =TRUE,space=c(1,3))
barplot(d,space=3)

L’argomento names.arg: vettore di nomi da rappresen-


tare sotto ogni barra; se non specificato i nomi verranno
presi di default dai nomi dell’argomento height (sia esso
vettore o matrice);

L’argomento legend.text: è solo utile quando height è


una matrice, in tal caso legend.text=TRUE i nomi di riga
diventeranno le etichette della legenda.
Mosaic plot

Se invece l’argomento è un oggetto di classe table, si


ha un grafico a mosaico (mosaic plot).

Il grafico a mosaico è utile per visualizzare tabelle di


contingenza.

colore.occhi<-c("C", "C", "C","S", "S",


"S", "C", "C", "S", "S", "S")
colore.capelli<- c("Sc", "Sc", "Sc", "Ch",
"Ch", "Ch", "Ch", "Sc", "Sc", "Sc", "Sc")
TAB.colori<-table(colore.occhi,colore.capelli)
plot(TAB.colori)
plot(TAB.colori, color=TRUE)

L’idea è si suddividere un quadrato unitario in tanti sotto


rettangoli per ciascuna cella della tabella in modo che
l’area di ogni rettangolo si proporzionale alla frequenza
di cella corrispondente.
Il grafico a mosaico è anche disponibile nella libreria vcd
con la funzione mosaicplot()

library(vcd)
mosaicplot(TAB.colori,col=TRUE)

125
Poissonness plot (Hoaglin, 1980):distplot()
Sia x0 , x1 , . . . la distribuzione di frequenza (conteggi) os-
servata. Sia N = x0 + x1 + . . .
La funzione di densità di probabilità di Poisson è
Pλ {X = k} = pλ (k) = e−k λk /k! k = 0, 1, 2, . . .

Per derivare il grafico supponiamo che per un fissato


livello di λ ogni frequenza osservata xk sia uguale alla
frequenza attesa mk definita come
mk = N pλ (k) = N e−k λk /k! k = 0, 1, 2, . . .

Allora ponendo l’uguaglianza xk = mk e prendendo il


logaritmo naturale di entrambi i membri si ha
log(xk ) = log(N ) − λ + klog(λ) − log(k!)

Pertanto rappresentando log(xk ) + log(k!) vs. k si ot-


tiene una retta con coefficiente angolare pari a log(λ) e
intercetta log(N ) − λ.
Quindi relazioni non lineari indicheranno un cattivo adat-
tamento!
Se l’adattamento Poissoniano dal grafico risulta soddi-
sfacente allora possiamo stimare λ usando lo stimatore
di massima verosimiglianza
P
kxk
λ̂ =
N
In modo simile è possibile ricavare binomialness plot
(Hoaglin and Tukey, 1985) e negative binomialness plots
(Hoaglin and Tukey, 1985).
126
1. Esempio di buon adattamento

library(vcd)
library(VGAM)
data(ruge)
ruge
distplot(ruge, type = "poisson")

2. Esempio di cattivo adattamento

data("Federalist")
sum(Federalist)
distplot(Federalist, type = "poisson")

127
Grafico di associazione di Cohen e Friendly:assocplot()

Il grafico di associazione di Cohen e Friendly mette in


evidenza le deviazioni dall’ipotesi di indipendenza in una
tabella di contingenza a due entrate.

In particolare in una tabella a due vie i residui di Pear-


son dij sono i contributi, dotati di segno, della cella (i, j)
all’indice X 2 di Pearson.
Nel grafico di Cohen-Friendly, ogni cella è rappresenta-
ta da un rettangolo che ha altezza proporzionale (con
il rispettivo segno) a dij e larghezza proporzionale al-
la radice quadrata di µij . In questo modo l’area del
rettangolo è proporzionale alla differenza tra frequenze
osservate e attese.
Ogni rettangolo è posizionato su una riga di riferimento
che rappresenta l’ipotesi di indipendenza (dij = 0).

Se la frequenza osservata nella cella ij è più grande di


quella attesa il rettangolo è al di sopra della linea di
riferimento ed è colorato in nero; altrimenti si trova al
di sotto della linea ed è colorato in rosso. Naturalmente
i colori dei rettangoli si possono modificare (vedi l’help).
E’ dunque possibile non solo valutare immediatamente
la presenza o l’assenza di associazione tra le variabili,
ma anche dire a quale/i cella/e si deve maggiormente
l’allontanamento da questa ipotesi.

assocplot(TAB.colori)

128
Parte 13: Esercizi svolti

129
Esercizio 1
Modelliamo la probabilità di sviluppare una malattia car-
diaca (CHD) per ciascuna modalità della variabile espli-
cativa (pressione del sangue BP)
BP<-factor(c("<117","117-126","127-136","137-146",
"147-156","157-166","167-186",">186"))
CHD<-c(3,17,12,16,12,8,16,8)
n<-c(156,252,284,271,139,85,99,43)

structure(cbind(CHD,n),dimnames=list(as.character(BP),
c("Heart Disease","Sample Size")))

scores<-c( seq (from=111.5,to=161.5, by =10),176.5,191.5)

logit(pi ) = α + βxi
mod1<-glm(CHD/n ~ scores, family = binomial , weights =n)
summary (mod1)
Le probabilità predette in corrispondenza di ciascun li-
vello di pressione del sangue si possono estrarre dall’og-
getto mod1 usando la funzione predict() e specificando
l’opzione type=‘‘response’’.
predict(mod1,type ="response")
Un diagramma delle proporzioni osservate e di quelle
predette si genera nel modo seguente
p.hat<-predict(mod1, type ="response")
130
plot(scores,CHD/n, xlab ="Livello pressione sanguigna",
ylab="Proporzione")
lines (scores,p.hat)

Estraiamo i residui di Pearson

res.pear<-residuals(mod1, type ="pearson")

La statistica X 2 è data dalla somma del rapporti tra (ni −


µi )2 e le frequenze teoriche µi . Le frequenze osservate
ni sono:

obs<-cbind(yes=CHD,no=n-CHD)
obs

Le frequenze teoriche

mod1.all<-cbind(yes=n*p.hat,no=n-n*p.hat)
mod1.all

La statistica X 2 è

x.sq<-sum((obs-mod1.all)^2/mod1.all)
x.sq

I gradi di libertà sono pari al numero di modalità della va-


riabile categoriale meno il numero dei parametri stimati
(8 − 2 = 6). Il p-valore diventa

1-pchisq(x.sq,6)

Le statistiche G2
G2<-mod1$dev
X2<-x.sq
1-pchisq(X2,6)
1-pchisq (G2,6)
LR=mod1$null.deviance-mod1$deviance
1-pchisq(LR,6)

Con gdl = 6 non forniscono evidenze di scarso adatta-


mento.
Esercizio 2

Consideriamo un modello logistico con variabili esplica-


tive categoriali. Consideriamo pertanto uno studio sulle
malformazioni congenite dei bambini e il consumo di
alcol da parte delle mamme.

Alcol<- factor(c("0","<1","1-2","3-5",">=6"),
levels=c ("0","<1","1-2","3-5",">=6"))

Alcol
malform<-c(48,38,5,1,1)
n<-c(17066,14464,788,126,37)

Per porre uguale a zero la prima modalità

option(contrasts("contr.treatment","contr.poly"))

Costruiamo il modello saturo

mod.sat<-glm(malform/n~Alcol,family=binomial,weights =n)
mod.sat

Per porre uguale a zero l’ultima modalità

revAlcol<- factor(c("0","<1","1-2","3-5",">=6"),
levels=rev(c("0","<1","1-2","3-5" ,">=6")))
mod.sat.rev<-glm(malform/n~revAlcol,
family = binomial,weights =n)
mod.sat.rev

131
Si noti che i valori adattati sono gli stessi per entrambe
le parametrizzazioni e sono uguali alle proporzioni os-
servate perchè entrambi i modelli sono saturi avendo un
numero di osservazioni pari al numero di parametri

cbind(logit=predict(mod.sat),fitted.prop=
predict(mod.sat,type ="response"),malform/n)

cbind(logit=predict(mod.sat.rev),fitted.prop=
predict(mod.sat.rev, type ="response"),malform/n)

La seconda colonna può essere ottenuta come

plogis(predict(mod.sat.rev))

Le proporzioni campionarie tendono a crescere con


il consumo di alcol.

Consideriamo un modello in cui si suppone che le mal-


formazioni siano indipendenti dal consumo di alcol:

mod.ind<- glm(malform/n ~1,family =binomial ,weights =n)


mod.ind

La statistica X 2

sum(residuals(mod.ind, type ="pearson")^2)

La statistica TRV

mod.ind$deviance
1-pchisq(12.34671,4)
Rigetto l’ipotesi di buon adattamento.

Il modello appena considerato e le statistiche di bontà


di adattamento utilizzate per valutarlo non considera-
no l’ordinamento intrinseco alle modalità della variabile
consumo di alcol. Per tenere conto dell’ordinamento
tra le modalità dobbiamo assegnare a queste opportuni
punteggi. Ad esempio:

scores<-c(0,0.5,1.5,4,7)
mod.score<-glm(malform/n~scores,family=binomial,weights=n)
summary (mod.score)

La statistica X 2

sum(residuals(mod.score,type="pearson")^2)

LR test

mod.score$null.deviance-mod.score$deviance

cbind(logit=predict(mod.score),fitted.prop=
predict(mod.score, type ="response"))
Esercizio 3

In uno studio retrospettivo caso-controllo, alcuni pazien-


ti che soffrono di ulcera sono appaiati a pazienti simili
che non ne soffrono. I pazienti che soffrono di ulcera
sono classificati a seconda della localizzazione dell’ul-
cera: gastrica (G) o duodenale (D). Si vuole studiare
l’impatto sull’ulcera derivante dall’uso di aspirina fra i
vari pazienti (U = utilizzatori di aspirina, NU = non uti-
lizzatori). In particolare, si vuole stabilire se l’ulcera è
associata all’uso di aspirina e se l’effetto dell’aspirina è
diverso a seconda del sito dell’ulcera.

conteggi <- c(62,6, 39,25, 53,8, 49,8)


ulcera <- factor(c(rep("G",4), rep("D",4)))
stato <- factor(rep(c("cont","cont","caso","caso"),2))
aspirina <- factor(rep(c("NU","U"),4))

Per stabilire se l’aspirina è un fattore di rischio per l’ul-


cera, si deve accertare la significatività dell’interazione
fra le variabili aspirina e stato dopo avere corretto per
gli effetti delle altre variabili. I modelli da confrontare
sono quindi quello che contiene le variabili stato e ulce-
ra, la loro interazione e la variabile aspirina con quello
che contiene anche l’interazione fra le variabili aspirina
e stato:

mod<-glm(conteggi~ulcera*stato+aspirina,
family=poisson())
mod2<-glm(conteggi~ulcera*stato+aspirina*stato,
family=poisson())
132
Il modo migliore per stabilire la significatività dell’inte-
razione è paragonare le devianze dei due modelli:
anova(mod, mod2, test="Chisq")
La differenza fra i modelli è altamente significativa. Si
conclude quindi che l’aspirina può essere considerata un
fattore di rischio per l’ulcera. Un modo alternativo per
verificare la significatività del termine di interazione è
quello di analizzare il test di Wald che viene presentato
dalla chiamata:
summary(mod2)
tuttavia, data la sua bassa potenza, per stabilire la signi-
ficatività di un qualunque coefficiente è preferibile ese-
guire il test sulle devianze. Per stabilire se l’aspirina è
associata in modo differente alle localizzazioni di ulcera,
si adatta il modello che comprende anche l’interazione
fra le variabili aspirina e ulcera:
mod3<-glm(conteggi~ulcera*stato*aspirina-
ulcera:stato:aspirina,family=poisson())

anova(mod2, mod3, test="Chisq")


Per interpretare questo risultato si possono esaminare i
coefficienti del modello adattato:
summary(mod3)
Dal fatto che il penultimo coefficiente del modello sia
positivo si conclude che l’uso di aspirina è un fattore di
rischio maggiore per l’ulcera gastrica rispetto a quella
duodenale. Per quanto riguarda la bontà di adattamen-
to del modello finale, i residui di Pearson si ottengono
con la chiamata:

res.pear<-residuals(mod3,"pearson")

da cui la statistica X 2 :

X2<-sum(res.pear^2)

Analizzando l’output della funzione summary() dato in


precedenza si nota che vi è un solo grado di libertà resi-
duo, quindi il p-valore per il test di bontà d’adattamento
è:

1 - pchisq(6.48795, 1)

da cui si conclude che il modello non segue partico-


larmente bene i dati pur impiegando 7 parametri per
descrivere 8 rilevazioni.

You might also like