0% found this document useful (0 votes)
28 views9 pages

(1972) The Analysis of Multivariate Binary Data

The document reviews methods and models for analyzing multivariate binary data, highlighting the lack of a comprehensive theoretical framework compared to standard second-order methods. It discusses common approaches such as applying second-order methods to binary data and using multidimensional contingency tables, while also proposing new models that require further development. Key topics include logistic models, independence, and various types of multivariate binary distributions.

Uploaded by

Phạm Hùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views9 pages

(1972) The Analysis of Multivariate Binary Data

The document reviews methods and models for analyzing multivariate binary data, highlighting the lack of a comprehensive theoretical framework compared to standard second-order methods. It discusses common approaches such as applying second-order methods to binary data and using multidimensional contingency tables, while also proposing new models that require further development. Key topics include logistic models, independence, and various types of multivariate binary distributions.

Uploaded by

Phạm Hùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Analysis of Multivariate Binary Data

Author(s): D. R. Cox
Reviewed work(s):
Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 2
(1972), pp. 113-120
Published by: Blackwell Publishing for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2346482 .
Accessed: 09/06/2012 03:07

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp

JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].

Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).

http://www.jstor.org
The AnalysisofMultivariate
BinaryDatat
By D. R. Cox
ImperialCollege,London

SUMMARY
A briefreviewis givenof themainmethodsand modelsfortheanalysisof
multivariatebinary data. The relation with standard second-order
techniquesis discussed.
Keywords: BINARY DATA; QUANTAL RESPONSE; MULTIVARIATE ANALYSIS; MULTI-
DIMENSIONAL CONTINGENCY TABLE; CHI-SQUARED; LATENT STRUCTURE
MODEL; PRINCIPAL COMPONENTS; CLUSTERING; DISCRIMINANT ANALYSIS;
LOGISTIC MODEL; PROBIT ANALYSIS; PERMUTATIONS; MODES; BAHADUR
REPRESENTATION

1. INTRODUCTION
IT is fairlycommonto havemultivariate data in whichtheindividual variatesare
binary, i.e.takeoneofjusttwopossiblevalueswhichcanbe codedas 0 and 1. While
therehas beenappreciable workon theanalysisofsuchdatathereis no thoroughly
developedbodyof methods and theory to correspond to so-calledsecondorderor
normaltheory methods.The objectofthepresent paperis to reviewmethods that
havebeenproposedand to outlinesomenewproposals,whichdo, however, need
muchfurther development.
The twomainmethods in commonuse areprobably thefollowing:
(a) to applysecond-order methodsjust as if the O's and l's are quantitative
observations;
(b) to use the theoryof multidimensional contingency tablesleadingin one
approachto a seriesof chi-squared testsand to thepartition of a totalchi-
squared(Lancaster, 1969;Plackett,1969).
Method(a) has the major advantageof simplicity and is effective whenthe
dependencies areofa simpleform;it can,however, takeno accountofeffects which
dependessentially on theinterrelationshipsof variablestakenthreeor moreat the
time.The use of chi-squared willnotbe considered further,mainlybecauseof its
rather strong emphasis on significance
testing.
Exceptfora fewremarks at theendon datawithmixedbinaryand quantitative
variates, we deal onlywithbinaryvariates.It is, however, likelythatmostof the
discussion canbe extended tovariateswith,say,threeorfourlevelsofresponse.
2. STUDIES OF DEPENDENCE AND OF ASSOCIATION
A firstimportant is betweenbinaryvariablesrepresenting
distinction responses
and thoserepresentingexplanatoryvariables
or factors.In fact,ifthereis just one
responsevariable,we havean essentially
univariatesituation
analogousto analysis
t Based on a talkgivento theMultivariate Society,April1971.
StudyGroup,RoyalStatistical
113
114 APPLIED STATISTICS

of varianceor multiple Cox (1970)has givena connected


regression. accountand
littlemorewillbe saidhere.
In fact,ifthesinglebinarydependentvariableis denotedby Y, we areessentially
concernedwiththe dependenceof E(Y) = P(Y = 1) = 0 on othervariablesin the
problem.Ifwe assumea linearrepresentation forE( Y) andapplytheordinary least
squaresformulae, we get simpleand reasonableprocedures providedthat the
probability lies betweenabout0-2 and 0-8; outsidethatrangehoweverthereare
difficulties
arisingpartlyfromtheneedforweighting and moreseriously fromfitted
valuesoutsidetherange(0,1). In thatcase theuse of a linearlogisticmodelwill
usuallybe best,i.e. we assumea linearmodelforlog{0/(l- 0)} and fit,normally by
maximum likelihood.Although computationallymorecomplicated, thesituationis
in principlethenexactly comparable to multiple
regression.
Whiletheabovemaywellbe themostfrequently arisingsituation, fromnowon
weshallconcentrate onthegenuinely situation
multivariate inwhichthereareseveral,
and indeedpossiblymany,binaryresponsevariates.We thenhave to studythe
associationbetweenthesevariablesand notjust thedependence of one variateon
others.
The centralproblemis thusto describethejointdistribution of a setof binary
variables.Oncethishas beendonewe can dealwithsuchproblems as
(a) concisecomparison oftwoormoresamples andtheconstruction ofdiscriminant
functions;
(b) reductions ofdimensionality analogousto principal components;
(c) clustering.
Second-order multivariateanalysisowessomeofitsrelative theoreticalsimplicity
to
the remarkable factthatall aspectsof the multivariate normaldistribution are
determined bythemeansand covariance matrixin all p(p + 3)/2parameters. It is,
however, an openempirical question howfrequentlythescientificallyusefulinforma-
tionis contained in meansand covariances, evenwhenthenormaldistribution is
superficiallya tolerablefit.
Theoldestapproachto multivariate datais to define
binary indicesofassociation
following essentiallyYule. Goodmanand Kruskalhavereviewed andextended this
work(1954,1959,1963). We shallnot considerit further, aimingto workmore
directly withthedistribution itself.

3. MULTIVARIATE BINARY DISTRIBUTIONS


We nowdiscussthemainwaysofdescribing Thisis a desirable
suchdistributions.
preliminary to the studyof methodsof analysis.The data of Solomon(1961),
reproduced as Table1,serveas a fairly
typical
example,although theproblems areof
coursedifferent inemphasisifthenumber is appreciably
p ofvariates greaterthanthe
valuefourinvolvedhere. (Note,however, thatthereis somedoubtas to howthe
illustrativedataarebestregarded.Following othersweshalltreatthedataas samples
fromtwofour-variate populations.It may,however,be betterconsideredas a single
population sampled"retrospectively".)
Therefollowbriefnoteson eightkindsofmodel.
(i) Independent variables.Thecomponent randomvariablesY1,...,Ypmay
binary
be treated as independent.Ofcoursethisgreatly suchthings
simplifies as thecompari-
sonofsamplesandtheconstruction functions.
ofdiscriminant A centralquestionis
ANALYSIS OF MULTIVARIATE BINARY DATA 115
fromindependence
thenhowlargethedepartures haveto be to maketheprocedures
basedonindependence
misleading. givesa modelwithp parameters.
Independence
(ii) Arbitrarymultinomialdistributions.Another simple model, in a sense
complementary to (i), is to treatthe sampleas a multinomial one with2P-1
independentparameters corresponding to the2P distinct
observationsthatcan be
obtained.Thedisadvantages ofthisarethatitis onlyapplicable
whenn,thenumber
ofobservations,
is fairlylargeandp is fairly
small,therebeinga reasonable
number
ofobservations
ineachcell,andalsothatitgiveslittleinsightintothestructure
ofthe
data.

TABLE 1
Distribution
offourbinaryvariatesin twogroups

Low L Q. group HighL Q. group

1111 62 122
1110 70 66
1101 31 33
1-100 41 25
1011 283 329
1010 253 247
1001 200 172
1000 305 217
0111 14 20
0110 11 10
0101 11 11
0100 14 9
0011 31 56
0010 46 55
0001 37 64
0000 82 53

Total 1,491 1,491

Source: Solomon (1961).

(iii) Logisticmodels.Theremainingmodelsareintermediate between (i) and(ii),


and allowthepresence ofspecialkindsofdependence. The simplest,mostflexible,
andinmanywaysthemostimportant modelsareprobably thelogistic
representations
oftheprobabilities.WriteZi = 2Yj-1, so thattheZ's takevalues+ 1. Supposethat
logP(Z1 = Z1, ..., Zp= Zp) = X1Z1+ - +cX Zp+0o12Z1z2?+ *..+. zp-lzp,

+...-A, (1)
whereA is a normalizing
constant;eA is a sumofexponentials chosento makethe
sumto unity.If onlythefirst
probabilities degreetermsare included,
we havethe
independencemodel (i), whereasif all termsup to z1 ... zp are taken we have in
effect
thegeneralmultinomial
model(ii). Whatwe hope foris thatonlya fairly
limited
number ofterms
needtobe included, in thelightofthedata.
usuallyselected
Thereare of coursemanyspecialcases; forinstancetheanalogueof the "equal
116 APPLIED STATISTICS
correlation"case ofnormaltheoryis to takeoij = a, ?ijk =...= Notethatifwe
putinall first
andseconddegreeterms thereare p(p + 1) parameters,
nearlyas many
as innormaltheory.Thisis ratherdisconcerting inthatonemight expectthatbinary
data wouldsupportsubstantiallyfewerparameters thanquantitativedata and one
suspectsthatmultivariate
normaltheory modelsareoftenover-parameterized.
The interpretation
of the parameters is best seen by consideringconditional
distributions.
For example,conditionallyon Z2- z2..., Zp = zp,
=
Ilog p(Z 1)} =l + 0112
Z2+ + ?lp Zp+ ?C123
Z2Z3+*

c indicatesa conditional
wherethesubscript probability.
Unfortunately,although
suchconditional
probabilities
havea simpleform,marginalprobabilities
do not. In
particular,log{P(Zl = 1)/P(Z =- 1)} and forp > 2

log{P(Z1 = 1 Z2 = Z2)1P(Z = -Z2


1 = Z2)}

arenotin generalsimply relatedto the&x's.


Thefitting andtesting ofmodelslike(i) forexample, bymaximum likelihoodare
fairly welldeveloped.However, themainemphasis shouldbe placedon theexploita-
tionofsucha model,forexample tofacilitatethecomparison ofsetsofdata,etc.
If all 2P cells are occupied,the modelscan be analysedin termsof the log
frequencies (Plackett,1969)butotherwise maximum likelihoodmethods willnormally
be thebestto use. A simplefirst step,whenall ornearlyall cellsare occupied, is to
computerankedfactorial contrasts fromthelog frequencies and to ploton a semi-
normalscale,indicating on theplotthetheoretical standard errorZ1(1Inijk).
The formalfitting of any but the simplest of thesemodelsis likelyto be an
effective approachonlyforfairly smallvaluesofp, becauseof thelargenumberof
parameters involved.Incidentally notethatthelikelihood ratiodiscriminant between
twogroups isa simple functionofthevariates oneata timeifandonlyifthecoefficients
ofall secondandhigher orderterms areidentical in thetwogroups.A similar result
holdswhenmorethantwogroupsarecompared.Thissuggests thatifsecond-order
techniques areappliedto binary data,"product"variates shouldbe included ifmajor
departures fromtheabovehomogeneity conditions arelikely.
The logisticmodelis implicit or explicitin a gooddeal ofworkon multivariate
binarydata; forsomehistorical comments andgeneraldiscussion seeMantel(1966).
(iv) Additive model.It wouldbe possibleto setouta representation analogously
to thatof(iii) butdirectly in termsofprobabilities ratherthanlog probabilities,or
moregenerally in termsofsomeotherfunction thanthelogarithm. Whichis better
is reallyto be settledempirically, buttheadditivemodelshavetwodisadvantages.
Therecan be difficulties withvaluesoutsidetherange(0,1) and "independence" is
notachievedas thesimplest specialcase. The seconddifficulty, butnotthefirst, is
overcomeby theBahadur(1961) representation. According to thisanyp-variate
binarydistribution can be writtenin a seriesas follows.FirstletOi= P(Yi = 1) and
introducethe standardizedvariablesUi = (Yi- Oi)/1{Oi(I- 0)}. Call

P12...k= E(U1 ... Uk)

the kthordercorrelationbetweenY1,..., Yk,with,of course,an analogous definition


foranyothersubsetofvariates.Thenthejointdistribution
oftheY's canbe written
ANALYSIS OF MULTIVARIATE BINARY DATA 117
in theform
p
P(Y = Y) fP(Yi = yi)

X ( 1+Ezpx utUj+ I PijkUtUjUk+ **+ Pl2....p Ul ... Up)

The secondfactorgivestheeffect ofdepartures fromindependence. Thisrepresenta-


tionis similar in spirit
to butprobably lessusefulthan(iii).
(v) Modalclustering model.Anentirely different
kindofmodelhasbeendiscussed
in an unpublished thesisbyA. F. Ebbutt.In itssimplest formthisis developedas
follows:
(a) Thereis first a subsetofvariablesC whichtakeconstant values.
(b) Variablesnotin C areindependently distributed.
(c) The variablesin C are,next,subjectto independent fairlysmallchangesof
misclassification.
(d) Finallytheremaybe twoormoresetsC orwithin a givenC theremaybe two
or more"modes".
In a largenumber ofdimensions itis noteasytodisentangle thissituation,butEbbutt
has produced methods thatwilldo it.
(vi) Latent class analysis. This is a special representation introducedby
Lazarsfeld (1950). It amountsto assuming a mixture ofsayk classeswithin eachof
whichthe variatesare independently distributed.For a reviewof estimation of
parameters, etc. see,forexample,Madansky(1969). Thisis a veryspecialmodel
likelyto be usefulwhenthereis strong priorexpectationthattheclasseshavea clear
physical existence.
(vii) Transformations by permutation.A centralroleis playedin second-order
multivariate methods bylineartransformations and especiallyby orthogonal trans-
formations. Ifwe applythesametechniques to binarydatawearein effect assuming
thatlinearfunctions ofbinary variateshavea usefulinterpretation;itis notclearthat
thisis alwaysso, although withtheinclusion ofproduct variatesthelinearfunctions
covera muchwiderrangeof aspects.In somecases,however, a differentkindof
transformation maybe morerelevant, namely permutation ofthedefining component
variates.

(
A simpleexamplewillclarify theidea. Considertwovariates

1 ifhusbandvotesLabour, I ifwifevotesLabour,
1= 0 otherwise, 2 otherwise.

For somepurposes
wemight geta simpler oftwo
ofthedatain terms
representation
newbinaryvariatesY' and Y'2defined
as follows:

y, = 1 ifhusbandandwifediscordant,y =
1 l? otherwise, 2 1

is to takeY' = Y', Y' = Y2. Thesethreearetheonlyessentially


A thirdpossibility
waysofusingthe22distinct
distinct responsesto definetwobinaryvariables.The
118 APPLIED STATISTICS
areas follows:
relationships

YI Y2 1 Y2 1 Y2

0 0 0 0 0 0
0 1 1 0 1 1
1 0 1 1 1 0
1 1 0 1 0 1

The most obviouscriterion for choosingbetweenrepresentations is to aim at


independence ofthedefining variates.
In general,a givensetof2P cellscan be usedto define p binary variates in many
different ways. In factthereare (2P- 1)!/p!essentially differentsets,i.e. onesthat
cannotbe obtainedfromone anotherby interchanging 0's and l's and permuting
variables.Dr P. Bloomfield has pointedout thatbyrestricting attention to trans-
formations directlyrelatedto theoriginal setofvariates,thisnumber maybe reduced
bya substantial factor.Thelargenumber forp > 3 is an advantage
ofpossibilities in
thatit meanswe havealmostas richa choiceofpossibletransformations as in the
continuous case; on theotherhandcomputationally it willbe an embarrassment!
Thebestapproachalmostcertainly dependson themagnitude ofp. Forsmallvalues
ofp, therecognition ofsimplestructure fitmaybe thebestapproach,
ina fulllogistic
butforlargervaluesofp an iterative approachworking on pairsofvariates at a time
is probably better.Thisremains to be explored.
Thereare now manythingsthatmightbe done. Transform if possibleto in-
dependence. Themarginal probabilities ofthenewvariates correspond to theeigen-
values in principalcomponentanalysis. For two samplestransform so that
discrimination is achievedwithjusta fewofthenewvariates;procedesimilarly with
morethantwo sets. It may be necessary to restrictthe permutations to some
meaningful subgroup.Untilalgorithms havebeendeveloped forimplementing these
ideasit is hardlypossibleto assesstheirusefulness.
continuous
(viii) Relationwithunderlying One historically
distribution. important
wayofobtaining binary is tostartwithoneormorecontinuous,
distributions possibly
of unobserved
normal,distributions variatesW1,..., Wpand to supposethatYi= 1
if and onlyif,say,Wi>0. Thisis quiteoftena usefulheuristic device,but seems
unnecessary unlesstheW's areofintrinsic
otherwise, interest.

In thissectiona numberofwaysof describing multivariatebinarydistributions


havebeenoutlined.Thisis veryprobablynota complete list. Whilemostof the
modelssuggested abovehave associatedwiththemschemesforformalestimation
and significance thatthisis nottheaspectof prime
it mustbe putstrongly
testing,
importance.The usefulness of the modelslies in theirapplicationto describe
complexdataconcisely, to facilitate
comparisonsbetween setsofdata,etc.
Probablythemostflexible modelin generalis thelogisticone of (iii). If it is
requiredto includedependence set of variables,
on a further x or otherformsof
structure,
suitabletermscanbe addedtotheright-hand side. Ifthereareq x variables
thesimplest thingwouldbe to addpq lineartermsrepresenting theeffects xiz1; this
leadsalso to a representation
ofmixedquantitativeandbinary variates.For we can
combinesay a multivariate of thequantitative
normaldistribution variateswitha
logistic
representation ofthebinary
distribution
fortheconditional variatesgiventhe
ANALYSIS OF MULTIVARIATE BINARY DATA 119
everytermin themodel(1) maybe allowedto
ones. More generally
quantitative
dependon x.
4. DISCUSSION
The previoussectionhas concentratedon thedescription binary
of multivariate
thisseeming
distributions, toa rational
preliminary
a necessary ofmethods
discussion
ofanalysis.In mostcasestheformoftheanalysis obvious;
fora givenmodelis fairly
it may,however, helpto drawsomeparallelswithfamiliar techniques
second-order
wouldrequirea much
and Table 2 setsout to do thisconcisely.A fulldiscussion
longerpaper.
Fora shortbibliography ofpaperson thesetopics,seeCox (1970,AppendixB).
TABLE 2
Outlineof some broadproblemsand appropriatesecondorderand binarytechniques

Secondorder Binary
(normaltheory)

Internal
problems
Descriptionof singlesample Calculationof means and Calculationof marginalpro-
covariancematrix portions and pairwise
Searchforspecialstructure Plotsof
logisticdifferences.
Transformations these
Searchforspecialstructure
Fitting of more elaborate
logisticor othermodel
Reductionin dimensions Principalcomponents Permutational principalcom-
ponents. Recognition of
meaningfulsets of inde-
pendentvariablesin logistic
representation
Clustering Various Modal clustering
under-
Searchforhypothesized Factoranalysis Latentclass analysis
lyingstructure
external
Univariate problems
Dependenceof univariatere- Multiple regression and Multiple regression often
sponseon complexexplana- analysisof variance logistic
toryvariables
external
Multivariate problems
Comparisonof two or more Hotelling'sT2,etc. Adaptationsof T2to examine
samplemeans marginalproportions
Full comparisonof two or Analysisalso of covariance Fitting and comparison of
moresamples matrices logisticor othermodels
Discriminantanalysis Linear discriminantfunc- Fittingof logistic or other
tion models and estimationof
likelihoodratio
Relationbetweenset of vari- Canonicalregression Studyof fittedlogisticmodel
ates and fixedexplanatory withadded x dependence
vectors Permutational analysis
Relationbetweentwo or more Canonicalcorrelation If feasible reduce to the
setsof variates previouscase
120 APPLIED STATISTICS
Noteaddedinproof.Sincethetalkon whichthispaperwasbasedwasgivenand
thepaperitselfacceptedforpublication,
therehavebeena number ofpaperson this
on thelog linearmodel;see in particular
generaltopic,especially thespecialmulti-
March1972,Vol. 28,No. 1.
variateissueofBiometrics,

REFERENCES
BAHADUR, R. R. (1961). A representation of thejointdistributionofresponsesto n dichotomous
items. In Studiesin ItemAnalysisandPrediction (H. Solomon,ed.), pp. 158-176. Stanford,
Calif.: StanfordUniversity Press.
Cox, D. R. (1970). TheAnalysisofBinaryData. London: Methuen.
GOODMAN,L. A. and KRUSKAL, W. H. (1954, 1959, 1963). Measuresof associationfor cross
classifications.J. Amer.Statist.Ass., 49, 732-764; 54, 123-163;58, 310-364.
LANCASTER, H. 0. (1969), Contingency tablesof higherdimensions.Bull. Int.Statist.Inst.,43,
I, 143-151.
LAZARSFELD, P. F. (1950). Logical and mathematical foundation of latentstructure analysis.In
Measurement and Prediction(S. A. Stoufferet al., eds.), pp. 362-412. Princeton,N. J.:
PrincetonUniversity Press.
MADANSKY, A. (1969). Latentstructure. In Int.Encl ofSocial Sciences,Vol. 9, pp. 33-38. New
York: Macmillanand Free Press.
MANTEL, N. (1966). Modelsforcomplexcontingency tablesand polychotomous dosageresponse
curves. Biometrics, 22, 83-95.
PLACKETT, R. L. (1969). Multidimensional contingency tables. Bull.Int.Statist.Inst.,43, I, 133-
142.
SOLOMON, H. (1961). Classification proceduresbased on dichotomousresponsevectors. In
Studiesin Item Analysisand Prediction(H. Solomon,ed.), pp. 177-186. Stanford,Calif.:
StanfordUniversity Press.

You might also like