(1972) The Analysis of Multivariate Binary Data
(1972) The Analysis of Multivariate Binary Data
Author(s): D. R. Cox
Reviewed work(s):
Source: Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 21, No. 2
(1972), pp. 113-120
Published by: Blackwell Publishing for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2346482 .
Accessed: 09/06/2012 03:07
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series C (Applied Statistics).
http://www.jstor.org
The AnalysisofMultivariate
BinaryDatat
By D. R. Cox
ImperialCollege,London
SUMMARY
A briefreviewis givenof themainmethodsand modelsfortheanalysisof
multivariatebinary data. The relation with standard second-order
techniquesis discussed.
Keywords: BINARY DATA; QUANTAL RESPONSE; MULTIVARIATE ANALYSIS; MULTI-
DIMENSIONAL CONTINGENCY TABLE; CHI-SQUARED; LATENT STRUCTURE
MODEL; PRINCIPAL COMPONENTS; CLUSTERING; DISCRIMINANT ANALYSIS;
LOGISTIC MODEL; PROBIT ANALYSIS; PERMUTATIONS; MODES; BAHADUR
REPRESENTATION
1. INTRODUCTION
IT is fairlycommonto havemultivariate data in whichtheindividual variatesare
binary, i.e.takeoneofjusttwopossiblevalueswhichcanbe codedas 0 and 1. While
therehas beenappreciable workon theanalysisofsuchdatathereis no thoroughly
developedbodyof methods and theory to correspond to so-calledsecondorderor
normaltheory methods.The objectofthepresent paperis to reviewmethods that
havebeenproposedand to outlinesomenewproposals,whichdo, however, need
muchfurther development.
The twomainmethods in commonuse areprobably thefollowing:
(a) to applysecond-order methodsjust as if the O's and l's are quantitative
observations;
(b) to use the theoryof multidimensional contingency tablesleadingin one
approachto a seriesof chi-squared testsand to thepartition of a totalchi-
squared(Lancaster, 1969;Plackett,1969).
Method(a) has the major advantageof simplicity and is effective whenthe
dependencies areofa simpleform;it can,however, takeno accountofeffects which
dependessentially on theinterrelationshipsof variablestakenthreeor moreat the
time.The use of chi-squared willnotbe considered further,mainlybecauseof its
rather strong emphasis on significance
testing.
Exceptfora fewremarks at theendon datawithmixedbinaryand quantitative
variates, we deal onlywithbinaryvariates.It is, however, likelythatmostof the
discussion canbe extended tovariateswith,say,threeorfourlevelsofresponse.
2. STUDIES OF DEPENDENCE AND OF ASSOCIATION
A firstimportant is betweenbinaryvariablesrepresenting
distinction responses
and thoserepresentingexplanatoryvariables
or factors.In fact,ifthereis just one
responsevariable,we havean essentially
univariatesituation
analogousto analysis
t Based on a talkgivento theMultivariate Society,April1971.
StudyGroup,RoyalStatistical
113
114 APPLIED STATISTICS
TABLE 1
Distribution
offourbinaryvariatesin twogroups
1111 62 122
1110 70 66
1101 31 33
1-100 41 25
1011 283 329
1010 253 247
1001 200 172
1000 305 217
0111 14 20
0110 11 10
0101 11 11
0100 14 9
0011 31 56
0010 46 55
0001 37 64
0000 82 53
+...-A, (1)
whereA is a normalizing
constant;eA is a sumofexponentials chosento makethe
sumto unity.If onlythefirst
probabilities degreetermsare included,
we havethe
independencemodel (i), whereasif all termsup to z1 ... zp are taken we have in
effect
thegeneralmultinomial
model(ii). Whatwe hope foris thatonlya fairly
limited
number ofterms
needtobe included, in thelightofthedata.
usuallyselected
Thereare of coursemanyspecialcases; forinstancetheanalogueof the "equal
116 APPLIED STATISTICS
correlation"case ofnormaltheoryis to takeoij = a, ?ijk =...= Notethatifwe
putinall first
andseconddegreeterms thereare p(p + 1) parameters,
nearlyas many
as innormaltheory.Thisis ratherdisconcerting inthatonemight expectthatbinary
data wouldsupportsubstantiallyfewerparameters thanquantitativedata and one
suspectsthatmultivariate
normaltheory modelsareoftenover-parameterized.
The interpretation
of the parameters is best seen by consideringconditional
distributions.
For example,conditionallyon Z2- z2..., Zp = zp,
=
Ilog p(Z 1)} =l + 0112
Z2+ + ?lp Zp+ ?C123
Z2Z3+*
c indicatesa conditional
wherethesubscript probability.
Unfortunately,although
suchconditional
probabilities
havea simpleform,marginalprobabilities
do not. In
particular,log{P(Zl = 1)/P(Z =- 1)} and forp > 2
(
A simpleexamplewillclarify theidea. Considertwovariates
1 ifhusbandvotesLabour, I ifwifevotesLabour,
1= 0 otherwise, 2 otherwise.
For somepurposes
wemight geta simpler oftwo
ofthedatain terms
representation
newbinaryvariatesY' and Y'2defined
as follows:
y, = 1 ifhusbandandwifediscordant,y =
1 l? otherwise, 2 1
YI Y2 1 Y2 1 Y2
0 0 0 0 0 0
0 1 1 0 1 1
1 0 1 1 1 0
1 1 0 1 0 1
Secondorder Binary
(normaltheory)
Internal
problems
Descriptionof singlesample Calculationof means and Calculationof marginalpro-
covariancematrix portions and pairwise
Searchforspecialstructure Plotsof
logisticdifferences.
Transformations these
Searchforspecialstructure
Fitting of more elaborate
logisticor othermodel
Reductionin dimensions Principalcomponents Permutational principalcom-
ponents. Recognition of
meaningfulsets of inde-
pendentvariablesin logistic
representation
Clustering Various Modal clustering
under-
Searchforhypothesized Factoranalysis Latentclass analysis
lyingstructure
external
Univariate problems
Dependenceof univariatere- Multiple regression and Multiple regression often
sponseon complexexplana- analysisof variance logistic
toryvariables
external
Multivariate problems
Comparisonof two or more Hotelling'sT2,etc. Adaptationsof T2to examine
samplemeans marginalproportions
Full comparisonof two or Analysisalso of covariance Fitting and comparison of
moresamples matrices logisticor othermodels
Discriminantanalysis Linear discriminantfunc- Fittingof logistic or other
tion models and estimationof
likelihoodratio
Relationbetweenset of vari- Canonicalregression Studyof fittedlogisticmodel
ates and fixedexplanatory withadded x dependence
vectors Permutational analysis
Relationbetweentwo or more Canonicalcorrelation If feasible reduce to the
setsof variates previouscase
120 APPLIED STATISTICS
Noteaddedinproof.Sincethetalkon whichthispaperwasbasedwasgivenand
thepaperitselfacceptedforpublication,
therehavebeena number ofpaperson this
on thelog linearmodel;see in particular
generaltopic,especially thespecialmulti-
March1972,Vol. 28,No. 1.
variateissueofBiometrics,
REFERENCES
BAHADUR, R. R. (1961). A representation of thejointdistributionofresponsesto n dichotomous
items. In Studiesin ItemAnalysisandPrediction (H. Solomon,ed.), pp. 158-176. Stanford,
Calif.: StanfordUniversity Press.
Cox, D. R. (1970). TheAnalysisofBinaryData. London: Methuen.
GOODMAN,L. A. and KRUSKAL, W. H. (1954, 1959, 1963). Measuresof associationfor cross
classifications.J. Amer.Statist.Ass., 49, 732-764; 54, 123-163;58, 310-364.
LANCASTER, H. 0. (1969), Contingency tablesof higherdimensions.Bull. Int.Statist.Inst.,43,
I, 143-151.
LAZARSFELD, P. F. (1950). Logical and mathematical foundation of latentstructure analysis.In
Measurement and Prediction(S. A. Stoufferet al., eds.), pp. 362-412. Princeton,N. J.:
PrincetonUniversity Press.
MADANSKY, A. (1969). Latentstructure. In Int.Encl ofSocial Sciences,Vol. 9, pp. 33-38. New
York: Macmillanand Free Press.
MANTEL, N. (1966). Modelsforcomplexcontingency tablesand polychotomous dosageresponse
curves. Biometrics, 22, 83-95.
PLACKETT, R. L. (1969). Multidimensional contingency tables. Bull.Int.Statist.Inst.,43, I, 133-
142.
SOLOMON, H. (1961). Classification proceduresbased on dichotomousresponsevectors. In
Studiesin Item Analysisand Prediction(H. Solomon,ed.), pp. 177-186. Stanford,Calif.:
StanfordUniversity Press.