Monte Carlo Simulation Based Statistical Modeling PDF
Monte Carlo Simulation Based Statistical Modeling PDF
Monte-Carlo
Simulation-
Based Statistical
Modeling
ICSA Book Series in Statistics
Series editors
Editors
Monte-Carlo
Simulation-Based Statistical
Modeling
13
Editors
Ding-Geng (Din) Chen John Dean Chen
University of North Carolina Risk Management
Chapel Hill, NC Credit Suisse
USA New York, NY
USA
and
University of Pretoria
Pretoria
South Africa
Over the last two decades, advancements in computer technology have enabled
accelerated research and development of Monte-Carlo computational methods. This
book is a compilation of invited papers from some of the most forward-thinking
statistica
statisticall resea
researchers
rchers.. These authors prese
present
nt new development
developmentss in Monte-Carlo
Monte-Carlo
simula
sim ulation
tion-ba
-based
sed sta
statis
tistic
tical
al modeli
modeling,
ng, thereb
thereby
y creati
creating
ng an opportu
opportunit
nity
y for the
exchange ideas among researchers and users of statistical computing.
Our aim in creating this book is to provide a venue for timely dissemination
of the research in Monte-Carlo simulation-based statistical modeling to promote
further research and collaborative work in this area. In the era of big data science,
this collection of innovative research not only has remarkable potential to have a
substantial impact on the development of advanced Monte-Carlo methods across
the spectrum of statistical data analyses but also has great promise for fostering new
research and collaborations addressing the ever-changing challenges and opportu-
nities of statistics and data science. The authors have made their data and computer
programs publicly available, making it possible for readers to replicate the model
development and data analysis presented in each chapter and readily apply these
new methods in their own research.
The 18 chapters are organized into three parts. Part I includes six chapters that
present and discuss general Monte-Carlo techniques. Part II comprises six chapters
with
wit h a common
common focus
focus on Mon
Monte-
te-Car
Carlo
lo met
method
hodss used
used in missin
missing
g data
data analys
analyses,
es,
which is an area of growing importance in public health and social sciences. Part III
is composed of six chapters that address Monte-Carlo statistical modeling and their
applications.
v
vi Preface
Chapter Jo “Join
intt Ge
Gene
nerarati
tion
on of Bi Bina
nary
ry,, Or
Ordi
dina
nal,
l, Co
Count
unt,, an
andd NoNormrmal
al Da
Data
ta wi
with
th
Specified Marginal and Association Structures in Monte-Carlo Simulations pre- ”
sents a unified framework for concurrently generating data that include the four
major types of distributions (i.e., binary, ordinal, count, and normal) with speci fied
marginal and association structures. In this discussion of an important supplement
to existing methods, Hakan Demirtas unifies the Monte-Carlo Monte-Carlo methods for specified
types
typ es of data
data and pre presen
sentsts his syssystem
temati
aticc and compreh
comprehens ensive
ive invest
investiga
igatio
tion
n for
mixed data generation. The proposed framework can then be readily used to sim-
ulate multivariate data of mixed types for the development of more sophisticated
simulation, computation, and data analysis techniques.
In Chapt
Chapterer Impr
“Improv ovin
ing
g ththee Efficicien
ency
cy of th thee Mo
Mont
nte-C
e-Cararlo
lo Me
Meththods
ods UsUsin
ing
g
Ranked Simulated Approach , Hani Samawi provides an overview of his devel-
”
opment of ranked simulated sampling; a key approach for improving the ef ficiency
of general Monte-Carlo methods. Samawi then demonstrates the capacity of this
approach to provide unbiased estimation.
In Chapter Normal and Non-normal Data Simulations for the Evaluation of
“
Two-
Tw o-Sa
Samp
mplele Lo Loca
cati
tion
on Te Test
stss , Jessica
” ica Hoag and Chia-Lin Ling Kuo discuss
Monte-
Mon te-Car
Carlo
lo sim
simula
ulatio
tion
n of nor
normal
mal and non-non-norm
normal al data
data to evaluat
evaluatee two-sa
two-sampl
mplee
location tests (i.e., statistical tests that compare means or medians of two inde-
pendent populations).
Chapter Anatomy of Correlational Magnitude Transformations in Latency and
“
corr
co rrel
elati
ationa
onall magni
magnitutude
de chchan
angegess in th
thee la
laten
tency
cy and
and didisc
scre
reti
tiza
zati
tion
on cont
context
extss of
Monte-
Mon te-Car
Carlo
lo stu
studie
dies.
s. Fur
Furthe
ther,r, author
authorss Hakan
Hakan Demirt
Demirtasas and CerenCeren Vardar
Vardar-Ac
-Acar
ar
provide a conceptual framework and computational algorithms for modeling the
correlation transitions under specified distributional assumptions within the realm
of discretization in the context of latency and the threshold concept. The authors
illustrate the proposed algorithms with several examples and include a simulation
study that demonstrates the feasibility and performance of the methods.
“Monte-Carlo
Chapter Monte- Carlo Simulation
Simulation of Corre
Correlated
lated Binary Respon
Responses
ses discusses ”
response data and compares those methods with methods using R software.
Preface vii
Chapter Qua
Quanti
“ ntifyi
fying
ng the Unc
Uncertertain
ainty
ty in Opt
Optima
imall Exp
Experi
erimen
mentt Sch
Scheme
emess via
Monte-Carlo Simulations provides a general framework for quantifying the sen-
”
sitivity
sitivity and uncertainty
uncertainty that result misspecification of model parameters in
result from the misspeci
optimal experimental schemes. In designing life-testing experiments, it is widely
ac
acce
cepte
ptedd th
that
at th
thee optim
optimal
al ex
expe
peririme
menta
ntall sche
scheme
me depe
depend
ndss on unkno
unknownwn mode
modell
parameters, and that misspecified parameters can lead to substantial loss of effi-
ciency in the
Tzong-Ru statistical
Tsai, analysis.
Y.L. Li, and To quantify
Nan Jiang this effect,
use Monte-Carlo Tony Ng,
simulations to Yu-Jau
evaluateLin,
the
robustness of optimal experimental schemes.
Chapter Markov Chain Monte-Carlo Methods for Missing Data Under Ignorability
“
Assumptions pre
” prese
sent
ntss a fu
full
lly
y Baye
Bayesisian
an me
meth
thod
od fo
forr usin
usingg th thee Mark
Markov
ov chai
chain
n
Monte-Carlo technique for missing data to sample the full conditional distribution
of the missing data given observed data and the other parameters. In this chapter,
Hare
Ha resh
sh Rocha
Rochani
ni an
andd DaDanie
niell Li
Linde
nderr show
show ho
howw to appl
apply
y ththes
esee meth
methods
ods to re
real
al
datasets with missing responses as well as missing covariates. Additionally, the
authors provide simulation settings to illustrate this method.
In Chapte
Chapterr A Multiple Imputation Framework for Massive Multivariate Data of
“
multiple imputation for massive multivariate data of variable types from planned
miss
mi ssing
ingne
ness
ss desi
design
gnss with
with th
thee pur
purpo
pose
se to buil
build
d th
theo
eoret
retic
ical
al,, al
algor
gorit
ithm
hmic
ic,, and
and
implementation-based components of a unified, general-purpose multiple imputa-
tion framework. The planned missingness designs are highly useful and will likely
increase in popularity in the future. For this reason, the proposed multiple impu-
tation framework represents an important refinement of existing methods.
Chapter HyHybr
“ brid
id Mo
Mont nte-
e-Ca
Carl
rlo
o in Mu
Mult ltip
iple
le Mi
Miss
ssin
ing
g Da
Data
ta Im
Impu
puta
tati
tions
ons wi
with
th
Application to a Bone Fracture Data introduces the Hybrid Monte-Carlo method as
”
longitudinal data that are incomplete due to missing at random dropout. In this
viii Preface
chapter, Ali Satty, Henry Mwambil, and Geert Muhlenbergs provide readers with
an overview of the issues and the different methodologies for handling missing data
in lo
long
ngit
itud
udin
inal
al data
datase
sets
ts th
that
at resu
result
lt fr
from
om drdrop
opout
out (e
(e.g
.g.,
., st
stud
udy
y at
attr
trit
itio
ion,
n, lo
loss
ss of
follow
follow-up
-up).
). The authors
authors exa
examin
minee the pot
potent
ential
ial streng
strengths
ths and weaknes
weaknessesses of the
various methods through two examples of applying these methods.
In Chapter Applications of Simulation for Missing Data Issues in Longitudinal
“
Clinical
addressingTrials , Frank
missing Liu
data and James
issues Kost present
in longitudinal simulation-based
clinical trials, such asapproaches for
control-based
imputa
imp utation
tion,, tip
tippin
ping-po
g-point
int analys
analysis,
is, and a Baye
Bayesiasian
n Markov
Markov chain
chain Monte-
Monte-Car
Carlo
lo
method. Computation programs for these methods are implemented and available in
SAS.
In Chap
Chapterter Applic
Applicati
“ ation
on of Mar
Markovkov Chai
Chain n Mon
Monte-te-Car
Carlo
lo Mul
Multip
tiple
le Imp
Imputat
utation
ion
Method to Deal with Missing Data from the Mechanism of MNAR in Sensitivity
Analysis for a Longitudinal Clinical Trial , Wei Sun discusses the application of
”
Mark
Ma rkov
ov chchai
ainn Mont
Monte-
e-Car
Carlo
lo mu
multltip
iple
le impu
imputa
tatio
tion
n fo
forr data
data ththat
at is miss
missin
ing
g not at
random
random in lon longitu
gitudin
dinal
al datase
datasets
ts fro
from
m cli
clinic
nical
al tri
trials
als.. This
This chapte
chapterr compar
compareses the
patterns of missing data between study subjects who received treatment and study
subjects who received a placebo.
chapter,
chapte r, Kyl
Kylee Iri
Irimat
mataa and JefJeffre
freyy Wil
Wilson
son dis
discus
cusss Monte-
Monte-CarCarlo
lo simula
simulation
tionss for
hierarchical linear mixed-effects models to fit the hierarchical logistic regression
models
mod els with
with random
random interc
intercept
eptss (both
(both ran
random
dom interc
intercept
eptss and random
random slopes
slopes)) to
multilevel data.
“Monte-Carlo
Chapter Monte -Carlo Methods in Finan Financial
cial Modeli
Modeling ng demonstrates the use of
”
Monte-Carlo
Monte- Carlo methods
methods in finan nancia
ciall mod
modeli
eling.
ng. In this
this chapte
chapter,
r, Chuansh
Chuanshu u Ji, Tao
Wang, and Leicheng Yin discuss two areas of market microstructure modeling and
option pricing using Monte-Carlo dimension reduction techniques. This approach
uses Bayesian Markov chain Monte-Carlo inference based on the trade and quote
database from Wharton Research Data Services.
Chapter Si Simul
“ mulat
atio
ionn St
Stud
udie
iess on ththee Ef
Effe
fect
ctss of ththee Ce
Cens
nsor
orin
ing
g DiDist
stri
ribu
buti
tion
on
Assump
Ass umptio
tion
n in the Ana
Analyslysis
is of Int
Interv
erval-
al-Cen
Censor
soreded Fai
Failur
luree Tim
Timee Dat
Dataa discusses
”
usess Monte-
use Monte-Car
Carlo
lo sim
simula
ulatio
tion
n to dem
demons
onstra
trate
te a robust
robust Bayesi
Bayesian
an multile
multilevel
vel item
item
resp
respononsese mo
modedel.
l. In this
this chap
chapteter,
r, Geng
Geng Ch Chen
en uses
uses datadata frfrom
om patipatien
ents
ts with
with
Parkinson s disease, a chronic progressive disease with multidimensional impair-
’
ments.
men ts. Using
Using the
thesese dat
data,
a, Chen ill illustr
ustrate
atess app
applyi
lying
ng the multil
multileve
evell item
item res
respons
ponsee
model to not only deal with the multidimensional nature of the disease but also
simultaneou
simul taneously
sly estimate
estimate measur
measurement -speciific parame
ement-spec parameter ters,
s, covari
covariate
ate ef
effec
fects,
ts, and
patient-specific characteristics of disease progression.
In Chapter A Comparison of Bootstrap Confidence Intervals for Multi-level
“
Longitudinal
Longitu dinal Data Using Monte-Monte-Carlo
Carlo Simul
Simulation
ation , Mark Reiser, Lanlan Yao, and
”
the
the auth
authors
ors pr
pres
esen
entt a simul
simulatatio
ionn study
study ththat
at shshow
ows,s, when
when comp
compar ared
ed with
with th thee
penalized regression approach, their variable selection procedure performs better.
As a genera
generall note,
note, the ref refere
erence
ncess for each
each chapte
chapterr are includ
includeded immedia
immediatel tely
y
following the chapter text. We have organized the chapters as self-contained units
so readers can more easily and readily refer to the cited sources for each chapter.
To facili
facilitat
tatee reader
readerss und
under
’ erst
stan
andi
ding
ng of th
thee meth
method
odss pr
pres
esent
ented
ed in th
this
is book,
book,
corresponding data and computing program can be requested from the fi rst editor by
email at DrDG.Chen@g
DrDG.Chen@gmail. mail.com.
com.
The editors are deeply grateful to many who have supported the creation of this
book. We thank the authors of each chapter for their contributions and their gen-
erous sharing of their knowledge, time, and expertise to this book. Second, our
sincere gratitude goes to Ms. Diane C. Wyant from the School of Social Work,
University of North Carolina at Chapel Hill for her expert editing and comments of
this
this book
book whic
which h subs
substatant
ntia
iall
lly
y upupli
lift
ft the
the qual
qualit
ity
y of th
this
is book
book.. We gr
grat
atef
eful
ully
ly
acknowledge the professional support of Hannah Qiu (Springer/ICSA Book Series
coordinato
coordinator)
r) and Wei Zha
Zhao o (assoc
(associat
iatee edi
editor
tor)) from
from Spr
Springe
ingerr Beijin
Beijing
g that
that made
made
publishing this book with Springer a reality.
x Preface
This book
This book br briings
ngs tog
ogeeth
ther
er exexpe
pert
rt re
rese
sear
arch
cher
erss enga
engage
gedd in Mont
Monte-
e-Ca
Carl
rlo
o
simulation-based statistical modeling, offering them a forum to present and dis-
cuss recent issues in methodological development as well as public health appli-
ca
cati
tions
ons.. It is di
divi
vide
dedd in
into
to th
thre
reee pa
part
rts,
s, with thee firs
with th rstt pr
prov
ovidi
iding
ng an overv
overvie
iew
w of
Monte-Carlo
ods,, an
ods and thee techniques,
d th th
thir
ird
d ad
addr the
dres
essi second
sing
ng Baye focusing
Bayesisian
an an
andd on
gemissing
gene
nera
rall st data
stat
atis
isti Monte-Carlo
tica
cal
l mode
modeli ng meth-
ling us
usin
ing
g
Monte-Carlo simulations. The data and computer programs used here will also be
made publicly available, allowing readers to replicate the model development and
data analysis presented in each chapter, and to readily apply them in their own
researc
research.
h. Featur
Featuringing hig
highly
hly top
topica
icall conten
content,
t, the book
book has the potent
potential
ial to impact
impact
model development
development and data analyses spectrum of fields, and to spark
analyses across a wide spectrum
further research in this direction.
xi
Contents
Partt I
Par Mon
Monte-
te-Carl
Carlo
o Tec
Techniq
hniques
ues
Partt II
Par Mont
Monte-C
e-Carlo
arlo Met
Methods
hods in Mi
Missin
ssing
g Data
Data
Markov Chain Monte-Carlo Methods for Missing Data
Under Ignorability Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Haresh Rochani and Daniel F. Linder
A Multiple Imputation Framework for Massive Multivariate
Data of Different Variable Types: A Monte-Carlo Technique . . . . . . . . . 143
Hakan Demirtas
xiii
xiv Contents
Partt III
Par Mon
Monte-
te-Car
Carlo
lo in Stati
Statisti
stical
cal Modell
Modellings
ings and Applic
Applicati
ations
ons
Contributors
H.K.T.
H.K. T. Ng Depar
Departmen
tmentt of Statis
Statistica
ticall Scienc
Science,
e, Southe
Southern
rn Methodi
Methodist
st Univer
Universit
sity,
y,
Dallas, TX, USA
Levent Ozbek Depa
Department
rtment of Statistic
Statistics,
s, Ankara University
University,, Ankara,
Ankara, Turkey
Mark Rei
Mark Reiser
ser Scho
School
ol of Mathe
Mathema
mati
tica
call and
and St
Stat
atis
isti
tica
call Scie
Scienc
nce,
e, Ariz
Arizona
ona St
Stat
atee
University, Tempe, AZ, USA
Haresh Rochani Department of Biostatistics, Jiann-Ping Hsu College of Public
Health, Georgia Southern University, Statesboro, GA, Georgia
Hani Michel Samawi Department of Biostatistics, Jia
Jiann-Ping
nn-Ping Hsu College Public
Health, Georgia Southern University, Statesboro, Georgia
A. Satty School of Mathematics, Statistics and Computer Science, University of
KwaZulu-Natal, Pietermaritzburg, South Africa
Jianguo Sun University of Missouri, Columbia, MO, USA
Wei Sun Manager Biostatistician at Otsuka America, New York, USA
T.-R. Tsai Department of Statistics, Tamkang University, Tamsui District, New
Taipei City, Taiwan
Ceren Vardar-Acar Department of Statistics, Middle East Technical University,
Ankara, Turkey
Tao Wang Bank of America Merrill Lynch, New York, NY, USA
Xiao Wang Statistics and Data Corporation, Tempe, AZ, USA
Jeffrey
Jeffrey R. Wil
Wilson
son W.P. Carey
Carey Sch
School
ool of Busine
Business,
ss, Arizona
Arizona State
State Univer
Universit
sity,
y,
Tempe, AZ, USA
Hui Xie Simon Frase
Fraserr Unive
University,
rsity, Burnaby,
Burnaby, Canada; The Unive
University
rsity of Illinois
Illinois at
Chicago, Chicago, USA
Lanlan
Lanlan YaYaoo Sc
Schoo
hooll of Ma
Mathe
thema
mati
tica
call and
and St
Stat
atis
isti
tica
call Sc
Scie
ienc
nce,
e, Ariz
Arizona
ona St
Stat
atee
University, Tempe, AZ, USA
Leicheng Yin Exelon
Exelon Business
Business Services Company, Enterprise
Enterprise Risk Management,
Chicago, IL, USA
Zhigang Zhang Memor
Memorial
ial Sloan Kettering
Kettering Cancer Center, New York, NY, USA
Part I
Monte-Carlo Techniques
Joint Generation of Binary, Ordinal, Count,
and Normal Data with Specified Marginal
and Association Structures in Monte-Carlo
Simulations
Hakan Demirtas, Rawan Allozi, Yiran Hu, Gul Inan and Levent Ozbek
Abstract This chapter is concerned with building a unified framework for con-
currently generating data sets that include all four major kinds of variables (i.e.,
binary, ordinal, count, and normal) when the marginal distributions and a feasible
association structure are specified for simulation purposes. The simulation paradigm
has been commonly employed in a wide spectrum of research fields including the
physical, medical, social, and managerial sciences. A central aspect of every simula-
tion study is the quantification of the model components and parameters that jointly
define a scientific process. When this quantification cannot be performed via deter-
ministic tools, researchers resort to random number generation (RNG) in finding
simulation-based answers to address the stochastic nature of the problem. Although
many RNG algorithms have appeared in the literature, a major limitation is that they
were not desi
designed
gned to
to concu
concurrent
rrently
ly accommo
accommodatedate all va
variabl
riablee types
types ment
mentioned
ioned abov
above.
e.
Thus, these algorithms provide only an incomplete solution, as real data sets include
variables of different kinds. This work represents an important augmentation of the
existing methods as it is a systematic attempt and comprehensive investigation for
mixed data generation. We provide an algorithm that is designed for generating data
of mixed marginals, illustrate its logistical, operational, and computational details;
and present ideas on how it can be extended to span more complicated distributional
settings in terms of a broader range of marginals and associational quantities.
4 H. Demirtas et al.
1 Intr
Introd
oduc
ucti
tion
on
computati
comput ationa
onall too
tools
ls tha
thatt of
offer
ferss pro
promis
mising
ing pot
potent
ential
ial for bu
build
ilding
ing enh
enhanc
anced
ed com
comput
puting
ing
infrastructure for research and education.
We propose an RNG algorithm that encompasses all four major variable types,
buil
bu ildi
ding
ng upo
uponn our pr
preevi
vious
ous woworkrk in ge
gene
nera
rati
tion
on of mu
mult
ltiivar
aria
iate
te ord
ordin
inal
al da
data
ta (D
(Dem
emir
irta
tass
2006),
2006 ), joint generation of binary and normal data (Demirtas and Doganay 2012 2012),),
2 Alg
Algor
orit
ith
hm
The algorithm is designed for concurrently generating binary, ordinal, count, and
continuous data. The count and continuous parts are assumed to follow Poisson
and normal distributions, respectively. While binary is a special case of ordinal, for
the purpose of exposition, the steps are presented separately. Skipped patterns are
allowed for ordinal variables. The marginal characteristics (the proportions for the
binary and ordinal part, the rate parameters for the count part, and the means and
variances for the normal part) and a feasible Pearson correlation matrix need to be
specified by the users. The algorithmic skeleton establishes the basic foundation,
extensions to more general and complicated situations will be discussed in Sect. 4.
The operational engine of the algorithm hinges upon computing the correlation
matrix of underlying MVN data that serve as an intermediate tool in the sense that
binary and ordinal variables are obtained via dichotomization and ordinalization,
6 H. Demirtas et al.
respectively, through the threshold concept, and count variables are retrieved by
correlation mapping using inverse cdf matching. The procedure entails modeling the
correlation transformations that result from discretization and mapping.
In what follows, let B , O , C , and N denote binary, ordinal, count, and normal
variable
variables,
s, respe
respecti
ctivel
vely
y. Let Σ be th
thee sp
spec
ecifi
ified
ed Pe
Pear
arso
son
n co
corr
rrel
elat
atio
ion
n ma
matr
trix
ix wh
whic
ich
h co
com-
m-
prises of ten submatrices that correspond to all possible variable-type combinations.
Required parameter values are p’s for binary and ordinal variables, λ’s for count
variables, (µ,σ 2 ) pa
pair
irss fo
forr no
norm
rmal
al var
aria
iabl
bles
es,, an
and
d th
thee en
entr
trie
iess of th
thee co
corre
rrela
lati
tion
on ma
matr
trix
ix
Σ . These quantities are either specified or estimated from a real data set that is to be
mimicked.
1. Check
Check if Σ is positive definite.
2. Fi
Find
nd the
the up
uppeperr an
and
d lo
lowe
werr co
corr
rrel
elat
atio
ionn bo
boun
undsds fo
forr al
alll pa
pair
irss by th
thee so
sort
rtin
ingg me
meththod
od of
Demirt
Dem irtas
as and Hed
Hedekeker
er (2011
2011).). It is we
well
ll-k
-knonown
wn ththat
at co
corr
rrelelat
atio
ions
ns ar
aree no
nott bo
bound
undeded
between 1 and 1 in most bivariate settings as different upper and/or lower
− +
bounds may be imposed by the marginal distributions (Hoeffding 1940 1940;; Fréchet
1951).
1951 ). These restrictions apply to discrete variables as well as continuous ones.
Let Π ( F , G ) be the set of cdf’s H on R 2 having marginal cdf’s F and G .
Hoeffding (1940(1940)) and Fréchet 1951)) proved that in Π ( F , G ), there exist cdf’s
Fréchet (1951
HL and HU , called the lower and upper bounds, having minimum and max-
imum correlation. For all (x , y ) R 2 , HL (x , y ) m a x F (x ) G ( y ) 1, 0
∈ = [ + − ]
and HU (x , y ) m i n F (x ), G ( y ) . For any H Π ( F , G ) and all ( x , y ) R 2 ,
= [ ] ∈ ∈
HL ( x , y ) ≤ H (x , y ) HU (x , y ). If δ L , δU , and δ den
≤ denote
ote the Pea
Pearso
rsonn cor
correl
relati
ation
on
coefficients for HL , HU , and H , respectively, then δ L ≤ ≤
δ δU . One can infer
that if V iiss un
that unif
ifor
orm
m in 0, 1 , th
[ ] en F −1 (V ) and G −1 (V ) are maxim
then maximally ally corre
correlate
lated;
d;
and F −1 (V ) and G −1 (1 V ) are maximally anticorrelated. In practical terms,
−
generating X and Y independently with a large number of data points before
sorting them in the same and opposite direction give the approximate upper and
lower correlation bounds, respectively. Make sure all elements of Σ are within
the plausible range.
3. Perform
Perform logical
logical checks such as binary
binary proportions
proportions are between
between 0 and 1, proba-
bilities add up to 1 for ordinal variables, the Poisson rates are positive for count
varia
va riable
bles,
s, va
varia
riance
ncess for nor
normal
mal va
varia
riable
bless are pos
positi
itive
ve,, the mea
mean,
n, va
varia
riance
nce,, pro
propor
por--
tion
tion an
andd ra
rate
te ve
vect
ctor
orss ar
aree co
cons
nsis
iste
tent
nt wi
with
th th
thee nu
numb
mberer of var aria
iabl
bles
es,, Σ is sym
symmemetri
tricc
and its diagonal entries are 1, to prevent obvious misspecification errors.
4. For B-B combi
combinatio
nations,
ns, find the tetr tetrachor
achoric ic (pre-d
(pre-dichot
ichotomiz
omizatio
ation)
n) corre
correlati
lation
on
given
giv en the speci
specified
fied phi coef
coefficie
ficient
nt (post
(post-dich
-dichotomi
otomizati
zationon corre
correlati
lation).
on). Let X 1 , X 2
repres
rep resent
ent bin
binary
ary va
varia
riable
bless suc
such thatt E X j
h tha [ ]= p j and C or ( X 1 , X 2 ) δ12 , wh
= wher eree
pj ( j = 1, 2) and δ12 (ph
(phii coe
coeffi
fficie
cient)
nt) are gi
give n. Let Φ t1 , t2 , ρ12 be the cdf for a
ven. [ ]
standa
sta ndard
rd bi
biva
varia
riate
te norm
normal al ran
random
dom va varia
riable
ble wit
with
h cor
correl
relati
ation
on coe
coeffi
ffici
cient
ent ρ12 (tetra-
t1 t2
Naturally, Φ t1 , t2 , ρ12
choric correlation). Naturally, [ ]= −∞ −∞ f (z1 , z 2 , ρ12 )d z1 d z 2 ,
2 1/2
where f (z 1 , z 2 , ρ12 ) 2π (1 ρ12 ) −1
ex p (z 12 2ρ12 z 1 z 2 z 22 )/
(2(1 − 2
ρ12 ))
=[ − ] × − −
. The connection between δ 12 and ρ 12 is reflected in the equation +
1/2
Φ z ( p1 ), z ( p2 ), ρ12
[ ]=δ12 ( p1 q1 p2 q2 ) +p 1 p2
|beδ established
X1 X2 | < |δ via| inthelarge
Z1 Z2 samples.
following The relationship
algorithm between
(Ferrari and δ
Barbiero δ
and ):
2012):
2012
X1 X2 Z1 Z2 can
a. Generate
Generate standard bivaria bivariate te norma correlation δ 0Z 1 Z 2 where
normall data with the correlation
δ 0Z 1 Z 2 = δ X 1 X 2 (Here, δ 0Z 1 Z 2 is the initial polychoric correlation).
b. Discr
Discretiz etizee Z 1 and Z 2 , based on the cumulative probabilities of the marginal
distribution F1 and F2 , to obtain X 1 and X 2 , respectively.
c. Com
Comput putee δ 1X 1 X 2 through X 1 and X 2 (Here, δ 1X 1 X 2 is th thee ord
ordin
inal
al ph
phii co
coefefficficie
ient
nt
after the first iteration).
d. Ex
Exec ecut utee th
thee fo
follllo
owi
wing
ng lo
loop op as lolong ng as δ vX 1 X 2 δ X 1 X 2 > ε and 1 v vma x
| − | ≤ ≤
(vma x and ε are the maximum number of iterations and the maximum toler-
ated absolute error, respectively, both quantities are set by the users):
(a) Update δ vZ 1 Z 2 by δ vZ 1 Z 2 = δ vZ−1 Z12 g (v), where g (v) δ X 1 X 2 /δ vX 1 X 2 . Here,
=
g (v) serves as a correction coefficient, which ultimately converges to 1.
(b) Generate bivariate normal data with δ vZ 1 Z 2 and compute δ vX+1 X1 2 after dis-
cretization.
Again, one should repeat this process for each B-O (and O-O) pair.
6. For C-C combinations, compute the corresponding
corresponding normal-normal correlations
correlations
(pre-mapping) given the specified count-count correlations (post-mapping) via
the inverse cdf method in Yahav and Shmueli (2012 ( 2012)) that was proposed in the
context of correlated count data generation. Their method utilizes a slightly
modified version of the NORTA
NORTA (Normal to Anything) approach (Nelsen 2006 2006),),
which involves
involves generation of MVN variate
variatess with give
given
n univariate marginals and
the correlation structure ( R N ), and then transforming it into any desired distri-
bution using the inverse cdf. In the Poisson case, NORTA can be implemented
by the following steps:
8 H. Demirtas et al.
a. Genera
Generate te a k -dimensional normal vector Z N from M V N distribution with
mean vector 0 and a correlation matrix R N .
b. Tr
Transfor
ansform m Z N to a Poisson vector X C as follows:
i. For each
each eleme nt z i of Z N , calculate the Normal cdf, Φ (z i ).
element
ii.. Fo
ii Forr ea
each
ch valalue
ue of Φ (z i ), ca
calc
lcul
ulat
atee th
thee Po
Pois
isso
son
n in
inve
vers
rsee cd
cdff wi
with
th a
desired−corresponding
λ i
marginal rate λi , Ψλ−i 1 (Φ
(Φ((z i )); where Ψλi (x ) =
x e λ
i =0 i ! .
1 1 T
c. XC = Ψ − (Φ
λi z −
(Φ(( ) ) , . . . , Ψ (Φ
i z
(Φ(( ))
λk
is a draw from the desired multi-
k
variate count data with correlation matrix R P O I S .
An ex
exact
act the
theore
oretic
tical
al con
connec
necti
tion
on bet
betwee
weenn R N and R P O I S has not bee
been
n est
establ
ablish
ished
ed
to date. However, it has been shown that a feasible range of correlation between
a pair of Poisson variables after the inverse cdf transformation is within ρ [ =
C or (Ψλ−i 1 (U ), Ψλ−j 1 (1 U )),ρ
− C or (Ψλ−i 1 (U ), Ψλ−j 1 (U )) , where λi and λ j
= ]
are the marginal rates, and U Uni f orm
∼ orm (0, 1). Yahav and Shmueli Shmueli (2012
2012))
propose
prop osedd a con
conceceptu
ptuallally
y sim
simple
ple met
method
hod to app
approxi
roximat
matee the rel
relati
ations
onship
hip bet
betwee
ween
n
the two correlations. They have demonstrated that R P O I S can be approximated
as an exponential function of R N where the coefficients are the functions of ρ
and ρ .
7. For B-N/O-N combinations,
combinations, find the biserial/polyserial
biserial/polyserial correlation
correlation (before dis-
cretization of one of the variables) given the point-biserial/point-polyserial cor-
relation (after discretization) by the linearity and constancy arguments pro-
posed by Demirtas and Hedeker (2016 (2016).
). Suppose that X and Y follow a bivari-
ate normal distribution with a correlation of δ X Y . Without loss of generality,
we may assume that both X and Y are standardized to have a mean of 0
and a variance of 1. Let X D be the binary variable resulting from a split on
X, XD I (X
= ≥k ). Thus, E X D [ ]= p and V X D
[ ]= = − pq where q 1 p.
The correlation between X D and X , δ X D X can be obtained in a simple way,
namely, δ X D X = D
= [ ]√ = [ | ≥ ]√
√CVov[ X[ X D]V, X[ X] ] E X D X / pq E X X k / pq . We can
also ex
also expre
press
ss the rel
relati
ations
onship
hip bet
between X and Y via the fol
ween follo
lowin
wing
g lin
linear
ear re
regre
gressi
ssion
on
model:
Y δXY X ε= + (1)
C ov X D , Y
[ ] = C ov[ X , δ
D XY X+ ε]
= C ov[ X , δ
D XY X ] + C ov[ X D, ε ]
= δ C ov[ X
XY D , X ] + C ov[ X D , ε] . (2)
Since ε is independent of X , it will also be independent of any deterministic
function of X such as X D , and thus C ov X D , ε will be 0. As E X
[ ] E Y [ ]= [ ]=
0, V X[ ] = [ ] = 1, C ov[ X
V Y ] = δ √ pq and C ov[ X , Y ] = δ , Eq
D, Y XDY Eq.. 2 XY
reduces to
δ =δ δ . XDY XY XD X (3)
=
mal curve at the point of dichotomization. Equation 3 indicates that the linear
association between X D and Y is assumed to be fully explained by their mutual
association with X (Demirtas and Hedeker 2016 2016).). The ratio, δ X D Y /δ X Y is equal
to δ X D X = [ ]√ = [ | ≥ ]√
E X D X / pq E X X k / pq . It is a constant given p and
the distribution of ( X , Y ). These correlations are invariant to location shifts and
scaling, X and Y do not have to be centered and scaled, their means and vari-
ance
an cess ca
cann ta
take
ke an
any
y fin
finit
itee val
alue
ues.
s. On
Once
ce th
thee ra
rati
tio
o (δ X D X ) is fo
foun
und,
d, on
onee ca
cann co
comp
mput
utee
the biserial correlation when the point-biserial correlation is specified. When X
is ordinalized to obtain X O , the fundamental ideas remain unchanged. If the
assumptions of Eqs. 1 and 3 are met, the method is equally applicable to the
ordinal case in the context of the relationship between the polyserial (before
ordinalization) and point-polyserial (after ordinalization) correlations. The eas-
10 H. Demirtas et al.
3 Some Oper
Operation
ational
al Details
Details and an Illustr
Illustrativ
ativee Example
Example
The inte
intermedi
rmediate
ate corre
correlati
lation
on matrix Σ ∗ –af
matrix –after
ter va
valid
lidati
ating
ng the fea
feasib
sibili
ility
ty of mar
margin
ginal
al
and correlational specifications and applying all the relevant correlation transition
steps turns out to be (rounded to three digits
digits after the decim
decimal)
al)
Generating N 10, 000 rows of data based on this eight-variable system yields the
=
following empirical correlation matrix (rounded to five digits after the decimal):
1 0.69823 0.67277 0.24561 0.40985 0.63891 0.22537 0.50361
0.69823 1 0.59816 0.21041 0.36802 0.57839 0.21367 0.45772
0.67277 0.59816 1 0.20570 0.32448 0.55564 0.20343 0.42192
0.24561 0.21041 0.20570 1 0.12467 0.20304 0.06836 0.17047
0.40985 0.36802 0.32448 0.12467 1 0.32007 0.12397 0.26377
0.63891 0.57839 0.55564 0.20304 0.32007 1 0.17733 0.41562
0.22537 0.21367 0.20343 0.06836 0.12397 0.17733 1 0.15319
0.50361 0.45772 0.42192 0.17047 0.26377 0.41562 0.15319 1
4 Futu
Futurre Di
Dirrec
ecti
tion
onss
The significance of the current study stems from three major reasons: First, data
analysts, practitioners, theoreticians, and methodologists across many different dis-
ciplines in medical, managerial, social, biobehavioral, and physical sciences will
be able to simulate multivariate data of mixed types with relative ease. Second, the
proposed work can serv
proposed servee as a mile
milestone
stone for the developmen
developmentt of more sophisticate
sophisticatedd
simulation, computation, and data analysis techniques in the digital information,
massive data era. Capability of generating many variables of different distributional
12 H. Demirtas et al.
types, nat
types, nature
ure,, and dep
dependendenc
encee str
struct
ucture
uress may be a con
contritribu
butin
ting
g fac
factor
tor for bet
better
ter gra
grasp-
sp-
ing the ope
operat
ration
ionalal cha
characracter
terist
istics
ics of tod
today’
ay’ss int
intens
ensivivee dat
dataa tre
trends
nds (e.
(e.g.,
g., sat
satell
ellite
ite dat
data,a,
internet traffic data, genetics data, ecological momentary assessment data). Third,
these ideas can help to promote higher education and accordingly be instrumental in
training graduate students. Overall, it will provide a comprehensive and useful set
of computational tools whose generality and flexibility offer promising potential for
building enhanced statistical computing infrastructure for research and education.
While this
this work represent
representss a decent step forwardforward in mixemixed d data generation,
generation, it may
not be sufficiently complex for real-life applications in the sense that real count and
cont
co ntin
inuo
uous
us da
data
ta ar
aree ty
typi
picacall
lly
y mo
more re co
comp
mplilica
cate
ted
d th
than
an wh
whatat Po
Poisisso
son
n an
and
d nor
normamall di
dist
stri
ri--
buti
bu tions
ons ac
acco
commmmodaodate te,, an
andd it is li
like
kely
ly th
that
at sp
spec
ecifi
ifica
cati
tion
on of papara
rame
mete
ters
rs th
that
at co
cont
ntro
roll th
thee
first two moments and the second order product moment is inadequate. To address
these concerns, we plan on building a more inclusive structural umbrella, whose
ingredients are as follows: First, the continuous part will be extended to encompass
nonnormal continuous variables by the operational utility of the third order power
polynomials. This approach is a moment-matching procedure where any given con-
tinu
tinuous
ous va
vari
riab
able
le in th
thee sy
syst
stem
em is exprxpres
esse
sedd by th
thee su
summ of liline
near
ar co
comb
mbin
inat
atio
ions
ns of po
pow-w-
ers of a standard normal variate (Fleishman 1978 1978;; Vale and Maurelli
Maurelli 1983
1983;; Demirtas
et al.
al. 2012
2012),), which requires the specification of the first four moments. A more elab-
orate version in the form of the fifth order system will be implemented (Headrick
2010)) in an attempt to control for higher order moments to cover a larger area in the
2010
skewness-elongation plane and to provide a better approximation to the probability
density functions of the continuous variables; and the count data part will be aug-
mented through the generalized Poisson distribution (Demirtas 2017b 2017b)) that allows
under-- and ove
under over-di
r-dispers
spersion,
ion, which is usual
usuallyly encou
encountere
nteredd in most applications,
applications, via
an additional dispersion parameter. Second, although the Pearson correlation may
not be the best association quantity in every situation, all correlations mentioned in
this chapter are special cases of the Pearson correlation; it is the most widespread
meas
me asur
uree of as
asso
soci
ciat
atio
ion;
n; an
and
d ge
gene
nera
rali
lity
ty of th
thee me
meth
thod
odss pr
prop
opos
osed
ed he
here
rein
in wi
with
th di
difffe
fere
rent
nt
kinds of variables requires the broadest possible framework. For further broadening
the scale, scope, and applicability of the ideas presented in this chapter, the proposed
RNG technique will be extended to allow the specification of the Spearman’s rho,
which is more popular for discrete and heavily skewed continuous distributions, will
be in
inco
corp
rpor
oratated
ed in
into
to th
thee al
algo
gori
rith
thm
m fo
forr co
conc
ncururre
rent
ntly
ly ge
gene
nera
rati
ting
ng al
alll fo
four
ur ma
majo
jorr ty
type
pess of
variables. For the continuous-continuous pairs, the connection between the Pearson
and Spearman correlations is given in Headrick (2010 2010)) through the power coeffi-
cients, and these two correlations are known to be equal for the binary-binary pairs.
The relationship will be derived for all other variable type combinations. Inclusion
of Spearman’s rho as an option will allow us to specify nonlinear associations whose
mono
mo noto
toni
nicc co
comp
mpone
onent
ntss ar
aree re
refle
flect
cted
ed in th
thee ra
rank
nk co
corr
rrel
elat
atio
ion.
n. Th
Thir
ird,
d, th
thee expa
xpand
ndeded fif
fifth
th
order po
order poly
lyno
nomi
mial
al sy
syst
stem
em wi
will
ll be fur
furth
ther
er au
augm
gmen
ente
ted
d to ac
acco
comm
mmod odat
atee L-
L-mo
mome
ment
ntss an
and
d
L-correlations (Hosking 1990
1990;; Serfling and Xiao 2007
2007)) that are based on expecta-
tions of certain linear combinations of order statistics. The marginal and product
L-moments are known to be more robust to outliers than their conventional counter-
parts in the sense that they suffer less from the effects of sampling variability, and
they enable more secure inferences to be made from small samples about an under-
lying probability distribution. On a related note, further expansions can be designed
to handle more complex associations that involve higher order product moments.
The salient advantages of the proposed algorithm and its augmented versions are
as follows: (1) Individual components are well-established. (2) Given their compu-
tational simplicity, generality, and flexibility, these methods are likely to be widely
used by researchers, methodologists, and practitioners in a wide spectrum of sci-
entific disciplines, especially in the big data era. (3) They could be very useful
in graduate-level teaching of statistics courses that involve computation and sim-
ulation, and in training graduate students. (4) A specific set of moments for each
variable is fairly rare in practice, but a specific distribution that would lead to these
mome
mo ment
ntss is ve
very
ry co
comm
mmon
on;; so ha
havi
ving
ng ac
acce
cess
ss to th
thes
esee me
meth
thod
odss is ne
need
eded
ed by po
pote
tent
ntia
iall
lly
y
a large group of people. (5) Simulated variables can be treated as outcomes or pre-
dictors in subsequent statistical analyses as the variables are being generated jointly.
(6) Required quantities can either be specified or estimated from a real data set. (7)
The final product after all these extensions will allow the specification of two promi-
nent types of correlations (Pearson and Spearman correlations) and one emerging
type (L-correlations) provided that they are within the limits imposed by marginal
distributions. This makes it feasible to generate linear and a broad range of nonlinear
associations. (8) The continuous part can include virtually any shape (skewness, low
or high peakedness, mode at the boundary, multimodality, etc.) that is spanned by
power polynomials; the count data part can be under- or over-dispersed. (9) Ability
to jointly generate different types of data may facilitate comparisons among existing
data analysis and computation methods in assessing the extent of conditions under
which available methods work properly, and foster the development of new tools,
especially in contexts where correlations play a significant role (e.g., longitudinal,
clustered, and other multilevel settings). (10) The approaches presented here can
be regarded as a variant of multivariate Gaussian copula-based methods as (a) the
binary and ordinal variables are assumed to have a latent normal distribution before
discreti
discre tizat
zation
ion;; (b) the cou
count
nt va
varia
riable
bless go thr
throug
ough
h a cor
correl
relat
ation
ion map
mappin
pingg proc
procedu
edure
re via
the norma
normal-to-
l-to-anyt
anything
hing approa
approach;
ch; and (c) the conti
continuous
nuous variables
variables consi
consist
st of poly-
nomial terms involving normals. To the best of our knowledge, existing multivariate
copulas are not designed to have the generality of encompassing all these variable
types simultaneously. (11) As the mixed data generation routine is involved with
latent variables that are subsequently discretized, it should be possible to see how
the correlation structure changes when some variables in a multivariate continuous
setting are dichotomized/ordinalized (Demirtas
(Demirtas 2016
2016;; Demirtas and Hedeker
Hedeker 2016
2016;;
Demirtas et al.al. 2016a
2016a). ). An important by-product of this research will be a better
understanding of the nature of discretization, which may have significant implica-
tions in interpreting
interpreting the coefficie
coefficients
nts in regression-type models when some predictors
predictors
are discretized.
discretized. On a rela
related
ted note, this could be usefu
usefull in meta-analysis
meta-analysis when some
studies discretize variables and some do not. (12) Availability of a general mixed
data generation algorithm can markedly facilitate simulated power-sample size cal-
culations for a broad range of statistical models.
14 H. Demirtas et al.
References
Amatya, A., & Demirtas, H. (2015). Simultaneous generation of multivariate mixed data with
Poisson and normal marginals. Journal of Statistical Computation and Simulation, 85, 3129–
3139.
Barbiero, A., & Ferrari, P. A. (2015). Simulation of ordinal and discrete variables with given
correl
cor relati
ation
on mat
matrix
rix and mar
margin
ginal
al dis
distri
tribu
butio
tions.
ns. R pac
packag
kagee Gen
GenOrd
Ord.. https://cran.r-project.org/web/
packages/GenOrd
Bates D., & Maechler M. (2016). Sparse and dense matrix classes and methods. R package Matrix.
http://www.cran.r-project.org/web/packages/Matrix
Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets.
Statistica Neerlandica, 58, 466–482.
Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized esti-
mating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical
Statistics, 14, 1085–1098.
Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for
non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.
Demirta
Dem irtas,
s, H. (20
(2006)
06).. A met
method
hod for mul
multi
tiva
varia
riate
te ord
ordina
inall dat
dataa gen
genera
eratio
tionn gi
give
venn mar
margin
ginal
al dis
distrib
tributi
utions
ons
and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025.
Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate inter-
est cent
centers
ers on dich
dichotomi
otomized
zed outco
outcomes
mes throu
throughgh pre-s
pre-specifi
pecifieded thres
threshold
holds.
s. Communic
Communications
ations in
Statistics-Simulation and Computation, 36 , 871–889.
Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine,
26, 3818–3821.
Demirta
Dem irtas,
s, H. (20
(2008)
08).. On imp
imputi
uting
ng con
contin
tinuou
uouss dat
dataa whe
when n the ev
event
entual
ual int
intere
erest
st per
pertai
tains
ns to ord
ordina
inaliz
lized
ed
outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271.
Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal,
51, 677–688.
Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal
of Applied Statistics, 37 , 489–500.
Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric
correlation under nonnormal underlying distributions. American Statistician, 70, 143–148.
Demirtas, H. (2017a). Concurrent generation of binary and nonnormal continuous data through
fifth order power polynomials. Communications in Statistics–Simulation and Computation, 46,
489–357.
Demirtas, H. (2017b). On accurate and precise generation of generalized Poisson variates. Com-
munications in Statistics–Simulation and Computation, 46, 489–499.
Demirta
Demirtas,
s, H., Ahm
Ahmadiadian,
an, R., Ati
Atis,
s, S., Can
Can,, F. E., & Erc
Ercan,
an, I. (20
(2016a
16a).
). A non
nonnor
normal
mal loo
look
k at pol
polych
ychori
oricc
correlations: Modeling the change in correlations before and after discretization. Computational
Statistics, 31, 1385–1401.
Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of bias-
reduction
reduc tion tech
technique
niquess for var
variance
iance estim
estimation
ation in appr
approxima
oximate
te Baye
Bayesian
sian boots
bootstrap
trap imput
imputation
ation..
Computational Statistics and Data Analysis, 51 , 4064–4068.
Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with
specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–
236.
Demirtas,
Demirtas, H., Fre
Freels
els,, S. A., & Yuce
ucel,
l, R. M. (20
(2008)
08).. Pla
Plausi
usibil
bility
ity of mul
multi
tiva
varia
riate
te nor
normal
mality
ity ass
assump
umptio
tion
n
when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal
of Statistical Computation and Simulation , 78 , 69–84.
Demirtas, H., & Hedeker
Hedeker,, D. (2007). Gaussianization-based quasi-imputation and expansion strate-
gies for incomplete correlated binary responses. Statistics in Medicine, 26, 782–799.
Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communica-
tions in Statistics- Simulation and Computation, 37 , 1682–1695.
Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distrib-
utions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal
data. Statistics in Medicine, 27, 4086–4093.
Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper
Abstract This chapter explores the concept of using ranked simulated sampling
approach (RSIS) to improve the well-known Monte-Carlo methods, introduced by
Samawii (1999),
Samaw 1999), and extended to steady-state ranked simulated sampling (SRSIS)
Samawi (2000).
by Al-Saleh and Samawi 2000). Both simulation sampling approaches are then
extended to multivariate rank nkeed simulated sampl pliing (MVRSIS) and
multi
mul tiva
varia
riate
te ste
steady
ady-st
-state
ate ran
ranked
ked simula
simulate
ted
d sam
sampli
pling
ng approa
approachch (MVSRS
(MVSRSIS)IS) by
Samawi and Al-Saleh (2007 (2007)) and Samawi and Vogel
Vogel ((2013
2013).
). These approaches have
been
bee n dem
demons
onstra
trate
ted
d as pro
provid
viding
ing unbi
unbiase
ased
d est
estim
imato
ators
rs and improv
improving
ing the perform
performanc
ancee
of some of the Monte-Carlo methods of single and multiple integrals approximation.
Additionally, the MVSRSIS approach has been shown to improve the performance
and efficiency of Gibbs sampling (Samawi et al. al . 2012).
2012). Samawi and colleagues
showed that their approach resulted in a large savings in cost and time needed to
attain a specified level of accuracy.
1 Intr
Introd
oduc
ucti
tion
on
The te
The term
rm Mo
Mont
nte-
e-Ca
Carl
rlo
o re
refe
fers
rs to tech
techni
niqu
ques
es that
that us
usee ra
rand
ndom
om pr
proc
oces
esse
sess to ap
appr
prox
oxim
imat
atee
a non-stochastic k-dimensional integral of the form
θ = g (u )d u , (1.1)
Rk
18 H.M. Samawi
to use; however, the advantages and disadvantages of each method are not the pri-
mary concern of this chapter. The focus of this chapter is the use of Monte-Carlo
methods in multiple integration approximation.
The motivation for this research is based on the concepts of ranked set sampling
(RSS), introduced by McIntyre (1952 (1952). ). The motivation is based on the fact that the
i th quan
quanti
tifie
fiedd un
unit
it of RS
RSS S is si
simp
mplyly an obse
observ
rvat
atio
ionn fr
from wheree f (i ) iiss th
om f (i ) , wher thee dens
densit
ity
y
function of the i th order statistic of a random sample of size n . When the underlying
density is the uniform distribution on (0, 1), f (i ) follows a beta distribution with
− +
parameters (i , n i 1).
Samawii (1999
Samaw ( 1999)) was the first to explore the idea of RSS (Beta sampler) for inte-
gral approximation. He demonstrated that the procedure can improve the simulation
efficiency based on the ratio of the variances. Samawi’s ranked simulated sampling
procedu
proc edure
re RSI
RSIS S genera
generates
tes an indepe
independe
ndentnt ran
random sample U(1) , U(2) , . . . , U(n) , wh
dom sample whic
ichh
is de
deno
note
tedd by RSIS wheree U(i )
RSIS,, wher ∼ − + {=
β(i , n i 1), i 1, 2, ..., }
..., n and β(., (., .) denotes
the beta distribution. The RSIS procedure constitutes an RSS based on random sam-
ples from the uniform distribution U (0, 1). The idea is to use this RSIS to compute
1.1) with k
(1.1) 1, instead of using an SRS of size n from U (0, 1), when the range
=
of the integral in (1.1 1.1)) is (0, 1). In case of arbitrary range ( a , b) of the integral
in (1.1
1.1),
), Samawi (1999 ..., X (n) and the importance
(1999)) used the sample: X (1) , X (2) , ...,
sampling technique to evaluate (1.1 (1.1)), where X (i ) = −
FX 1 (U(i ) ) and FX (.) is the dis-
tribution function of a continuous random variable. He showed theoretically and
through simulation studies that using the RSIS sampler for evaluating (1.1 ( 1.1)) substan-
tially improved the efficiency when compared with the traditional uniform sampler
(USS).
Al-Saleh and Zheng (2002 ( 2002)) introduced the idea of bivariate ranked set sampling
(BVRSS) and showed through theory and simulation that BVRSS outperforms the
bivariate simple random sample for estimating the population means. The BVRSS
is as follows:
Suppose ( X , Y ) is a bivariate random vector with the joint probability density func-
tion f X ,Y (x , y ). Then,
1. A ra
rand
ndom
om samp
samplele of size n 4 is identified from the population and randomly
size
allocated into n 2 pools each of size n 2 so that each pool is a square matrix with
n rows and n columns.
2. In the first pool, identify the minimum v value
alue by judgment with respect tto
o the first
characteristic X , for each of the n rows.
3. Fo the n minima obtained in Step 2, the actual quantification is done on the pair
Forr the
that corresponds to the minimum value of the second characteristic, Y , identified
by judgment. This pair, given the label (1, 1), is the first element of the BVRSS
sample.
4. Repea
Repeatt Steps 2 and 3 for the second pool pool,, but in Step 3, the pair correspondin
corresponding g to
the second minimum value with respect to the second characteristic, Y , is chosen
for actual quantification. This pair is given the label (1, 2).
5. Th
Thee proc
proces
esss co
cont
ntin
inue
uess un
unti
till the
the labe
labell (n
(n,, n) is asce
ascert
rtai
aine
ned
d fr
from thee n 2 th (l
om th (las
ast)
t) po
pool
ol..
The pro
proced cedure ure descri
described bed above
above pro produc
duces
es a BVR
BVRSS size n 2 .Let ( X [i ]( j ) , Y(i )[ j ] ),
SS of size [
i = 1, 2, . . . , n and j = ]
1, 2, . . . , n denote the BVRSS sample from f X ,Y (x , y )
where f X [i ]( j ) , Y(i )[ j ] (x , y ) is the joint probability density function of ( X [i ]( j ) Y(i )[ j ] ).
From Al-Saleh and Zheng ((2002 2002), ),
f X [i ]( j ) , Y(i )[ j ] (x , y ) =f f X ( j ) (x ) f Y | X ( y x ) | (1.2)
Y(i )[ j ] ( y ) ,
f Y[ j ] ( y )
where f X ( j ) is the density of the j th order statistic for an SRS sample of size n from
the marginal density of f X and f Y[ j ] ( y ) be the density of the corresponding Y value −
∞
given by f Y[ j ] ( y ) = |
f X ( j ) (x ) f Y | X ( y x )d x , while f Y(i )[ j ] ( y ) is the density of the i th
−∞
order statistic of an iid sample from f Y[ j ] ( y ), i.e.
f Y(i )[ j ] ( y ) = c.(F [ ] ( y )) − (1 − F [ ] ( y )) −
Y j
i 1
Y j
n i
f Y[ j ] ( y )
y
where FY[ j ] ( y ) = ∞ ( |
f X ( j ) (x ) f Y | X (w x )d x )d w.
−∞ −∞
Combining these results, Eq. (1.2)
1.2) can be written as
n n
1
f X [i ]( j ) ,Y(i )[ j ] (x , y ) = f (x , y ). (1.4)
n2
j i
For a varie
For arietyty of ch
choi cess of f (u , v), one can have (U , V ) biv
oice bivariate
ariate uniform
with a probab
probabilitility
y densi
density
ty funct ion f(u, v); 0 < u, v < 1, such that U
function U (0, 1) and ∼
V ∼ U (0, 1) (See Johnson 1987 1987).). In that case, (U[i ]( j ) , V(i )[ j ] ), i [
1, 2, . . . , n and =
j = ]
1, 2, . . . , n should have a bivariate probability density function given by
n ! n !
f ( j ),(i ) (u , v) = [ FY[ ] (v)]i −1 [1 − FY[ ] (v)]n −i [u ] j −1 j j
(i − 1)!(n − i )! ( j − 1)!(n − j )!
(i !
1) (n !
i) (j !
1) (n j) !
[1 − u]n− j f (u, v).
(1.5)
20 H.M. Samawi
2 Steady-S
Steady-State
tate Ranked
Ranked Simulated
Simulated Sampling
Sampling (SRSIS)
(SRSIS)
Al-Saleh
Al-Saleh and Al-Oma
Al-Omarri (1999)
1999) intr
introdu
oduce
ced
d the
the id
idea
ea of mu
mult
ltis
ista
tage
ge ra
rank
nked
ed set
set samp
sampli
ling
ng
(MRSS). To promote the use of MRSS in simulation and Monte-Carlo methods, let
(s ) (s )
Xi i
{ ; = 1, 2, . . . , n , be an (MRSS
}s) of size n at stage s . Assume that X(is ) has
probability density function f i and a cumulati
cumulative
ve dist
distribu
ribution function Fi . Al-
tion function
Saleh and Al-Omeri demonstrated the following properties of MRSS:
1.
n
1 (s)
f (x ) =n fi (x ), (2.1)
=
i 1
2.
0 if x < Q (i −1)/ n
If s → ∞, then (s )
Fi (x ) → (
Fi
∞) ( x ) = n F (x ) − (i − ≤
1) i f Q (i −1)/ n x < Q (i )/ n ,
1 if x ≥ Q (i )/ n
(2.2)
for i = 1, 2, . . . , n, where Q α is the 100 α th percentile of F (x ).
3. If X ∼ U (0, 1), then for i = 1, 2, ...,
..., n , we have
0 if x < (i 1)/ n −
Fi
( ∞) (x ) = nx − − (i 1) i f (i 1)/ n −
x < i /n , ≤ (2.3)
1 if x i /n ≥
and
( ∞) n i f (i − 1)/ n ≤ x < i /n
fi (x ) = 0 otherwise.
(2.4)
These
These pro
proper
pertie
tiess imply
imply Xi
( ∞) U ( i −n 1 , ni ), when the underlying
underlying distribut
distribution
ion function
function
is U(0, 1).
Vogel (2013)
Samawi and Vogel
∼
2013) provided a modification of the Al-Saleh and Samawi
2000)) steady-state ranked simulated samples procedure (SRSIS) to bivariate cases
(2000
(BVSRSIS) as follows:
1. For each
each (i , j ), j = 1, 2, . . . , n and i = 1, 2, . . . , n generate independently
a. (Ui ( j ) from U
− ,
j 1
and independent W
j
from U − , , i = 1,
i 1 i
n n (i ) j n n
..., n ).
2, ...,
2. Gen
Genera te Y(i ) j
erate = F − (W Y
1
(i ) j ) and X i ( j ) = F − (UX
1
i ( j )) from FY ( y ) and FX (x )
respectively. − j 1 j
generatee ( X [i ]( j ) , Y(i )[ j ] ) from f (x , y ), generate Ui( j ) from U
3. To generat n
,n and
independent W (i ) j from U
− i 1 i
n
,n , then
|
X [i ]( j ) Y(i ) j = F −| 1
X Y (Ui ( j )
|Y (i ) j ) |
and Y (i )[ j ] X i ( j ) = F −| 1
Y X (W(i ) j
|X i ( j ) ).
( ∞)
f X [i ]( j )Y (x , y ) = f ∞[ ] ( ) ( ∞) |
(x ) f Y i | X [i ]( j ) ( y X [i ]( j ) ) =n 2
|
f X (x ) f Y | X [i ]( j ) ( y X [i ]( j ) ),
[]
(i ) j
X i ( j)
where Q X (s ) and Q Y (v) are the 100 s th percentile of FX (x ) and 100 v th percentile of
FY ( y ), respec
respecti
tive
vely
ly.. Howe
Howeve verr, for the first
first stage,
stage, both Stokess (1977)
both Stoke 1977) and
and Davi
David
d (1981)
1981)
showed that FY | X [i ]( j ) ( y x ) | =
FY | X ( y x ). Al-Saleh and Zheng (2003 |
(2003)) demonstrated
that joint density is valid for an arbitrary stage, and therefore, valid for a steady state.
Therefore,
fX
( ∞) = ( ∞) ( ∞)
= n2 f X (x ) f Y | X ( y|x ) = n2 fY, X (x , y ),
|
[i ]( j )Y(i )[ j ] (x , y ) fX
[i ]( j ) (x ) f Y i | X [i ]( j ) ( y X [i ]( j ) )
(2.5)
Q X ( j −1)/ n ≤ x < Q X ( j )/ n , Q Y (i −1)/ n ≤ y < Q Y (i )/ n .
= f (x , y ),
(2.6)
where I is an indica
indicator
tor va
varia
riable
ble.. Simila
Similarly Eq.((2.5)
rly,, Eq. 2.5) can
can be ext
xten
ende
ded
d by math
mathem
emat
atic
ical
al
induction to the multivariate case as follows:
f ( ∞) (x1 , x2 , ...,
..., xk ) = n k f (x1 , x2 , ...,
..., xk ), Q X i ( j −1)/ n ≤ =
xi < Q X i ( j )/ n , i 1,
. . . , k and j = 1, 2,.., n .. In addition, the above algorithm can be extended for k > 2
as follows:
22 H.M. Samawi
2. Gen
Generaerate
te X il (is ) FX−i1 (Uil (i s ) )l , s
= =
1, 2, . . . , k and i l , i s =
1, 2, . . . , n , from
l
FX il (x ), l =
1, 2, . . . , k , respectively.
3. Then, genera
generate
te the mult
multiv ivaria
ariatete version of the steady-state
steady-state simulated
simulated sample by
using any technique for conditional random number generation.
3 Monte-Car
Monte-Carlo
lo Method
Methodss for
for Multip
Multiple
le Integr
Integration
ation Problem
Problemss
Very good descriptions of the basics of the various Monte-Carlo methods have
been provided by Hammersley and Handscomb (1964 ( 1964),
), Liu (2001
(2001),), Morgan (1984
(1984),
),
Rober
Rob ertt and
and Case
Casellllaa (2004),
2004), and
and Shre
Shreidider
er (1966).
1966). The Monte-
Monte-CarCarlo
lo methods
methods descri
described
bed
include
inclu de crude
crude,, anti
antithet
thetic,
ic, importanc
importance, e, control
control variate
variate,, and stratified
stratified sampl
sampling
ing approaches.
approaches.
However, when variables are related, Monte-Carlo methods cannot be used directly
(i
(i.e
.e.,
., si
simi
mila
larr to the
the ma
mann
nnerer that
that thes
thesee meth
methododss are
are used
used in uni
univaria
ariate
te in
inte
tegr
grat
atio
ion
n pr
prob-
ob-
lems) because using the bivariate uniform probability density function f (u , v) as a
sampler
sampl er to evalua
evaluate 1.1) with k
te Eq. (1.1) 2, f (u , v) is not consistent. However, in this
=
context it is reasonable to use the importance sampling method, and therefore, it fol-
lows that other Monte-Carlo techniques can be used in conjunction with importance
sampling. Thus, our primary concern is importance sampling.
In general, suppose that f is a density function on R k such that the closure of the set
of points where g (.) is non-zero and the closure set of points where f (.)is non-zero.
[
Let U i i = ..., n ] be a sample from f (.). Then, because
1, 2, ...,
g (u )
θ = f (u )
f (u )d u ,
Equati
Equation 1.1)) can be estimated by
on (1.1
n
∧ 1 g (u i )
θ= . (3.1)
n f (u i )
=
i 1
Equati
Equation
on (3.1)
3.1) is an unbiased estimator for (1.1
( 1.1),
), with variance given by
1 g (u )2
V ar (θ ) ( du θ 2 ).
ˆ =n Rk
f (u ) −
In addition, from the point of view of the strong law of large numbers, it is clear that
ˆ→
θθ̂ θ almost surely as n →∞.
A limite
limited
d number
number of distri
distribu
butio
tional
nal famili
families
es exist
exist in a mul
multi
tidim
dimens
ension
ional
al con
conte
text
xt and
are commonly used as importance samplers. For example, the multivariate
multivariate Student’s
Student’s
family is used extensively in the literature as an importance sampler. Evans and
(1995)) indicated a need for developing families of multivariate distribution
Swartz (1995
that exhibit a wide variety of shapes. In addition, statisticians want distributional
families to have efficient algorithms for random variable generation and the capacity
to be easily fitted to a specific integrand.
This paper provides a new way of generating a bivariate sample based on the
bivariate
biva riate steady-state sampling (BVSRSIS) that has the potential to extend the exist-
ing sampling methods. We also provide a means for introducing new samplers and
to substa
substanti
ntial
ally
ly imp
improv
rovee sub
substa
stanti
ntiall
ally
y the effici
efficienc
ency
y of the integ
integrat
ration
ion approx
approxima
imatio
tion
n
based on those samplers.
Let
θ = g (x , y )d x d y . (3.2)
To estimate θ , generate a bivariate sample of size n 2 from f(x, y), which mimics
nd has the same range, such as ( X i j , Yi j ), i
g (x , y ) aan [
..., n and
1, 2, ..., =
j = ]
1, 2, ..., n . Then
n n
1 g (x i j , yi j )
θ̂θ ˆ= . (3.3)
n2 f ( x i j , yi j )
= =
i 1 j 1
Equati
Equation
on (3.3)
3.3) is an unbiased estimate for (3.2
( 3.2)) with variance
g 2 (x , y )
ˆ = 1
2
V ar (θ̂θ )
n 2
(
f (x , y )
dx dy − θ ). (3.4)
estimate (3.2)
To estimate 3.2) using
using BVSRSI
BVSRSIS, S, genera
generate
te a biva
bivaria
riate
te sam
sample size n 2 , as de
ple of size desc
scri
ribe
bed
d
[
in above, say ( X [i ]( j ) , Y(i )[ j ] ), i =
1, 2, ..., n and j =
1, 2, ..., n . Then ]
n n
1 g (x[i ]( j ) , y(i )[ j ] )
ˆ
θ̂θ BBVV S R S I S =n 2
. (3.5)
f (x[i ]( j ) , y(i )[ j ] )
= =
i 1 j 1
on (3.5)
Equation
Equati 3.5) is also an unbiased estimate for (3.2
( 3.2)) using ((2.5
2.5).
). Also, by using (2.5
(2.5))
the variance of (3.5
(3.5)) can be expressed as
24 H.M. Samawi
n n
1 (i , j )
ˆ
V ar (θ̂θ BBVV S R S I S ) = V ar (θ̂θˆ) − n 4
(θg/ f −θ g/ f )2 , (3.6)
= =
i 1 j 1
(i, j )
=
where, θ / f
g E [g ( X [ ] , Y [ ] )/ f ( X [ ] , Y [ ] )], θ = E [ g ( X , Y )/ f ( X , Y )] = θ .
i ( j) (i ) j i ( j) (i ) j g/ f
(3.6)) is less than the variance of the estimator in
The variance of the estimator in (3.6
(3.4).
3.4).
This section presents the results of a simulation study that compares the perfor-
mance
man ce of the import
importanc
ancee sam
sampli
pling
ng metho
methodd des
descri
cribed
bed above
above usi
using
ng BVSRSI
BVSRSISS scheme
schemess
with the performance of the bivariate simple random sample (BVUSS) and BVRSS
scheme
schemess by Samaw
Samawii and Al-Sal
Al-Saleh 2007)) as intr
eh (2007 introd
oduc
uced
ed by Sama
Samawi
wi and
and Voge
ogel (2013
2013).
).
3.3.1
3.3.1 Illus
Illustrat
tration
ion for
for Im
Import
portance
ance Sampl
Sampling
ing Metho
Method
d When Integral’
Integral’ss
Limits Are (0, 1)x(0, 1)
As in Sama
Samawi
wi an
andd Al
Al-S
-Sal
aleh
eh (2007),
2007), illu
illust
stra
rati
tion
on of th
thee impa
impact
ct of BVSR
BVSRSI
SIS
S on impo
imporr-
tance sampling is provided by evaluating the following integral
1 1
This example uses four bivariate sample sizes: n = 20, 30, 40 and 50. To estimate
the variances using the simulation method, we use 2,000 simulated samples from
BVUSS and BVSRSIS. Many choices of bivariate and multivariate distributions
with uniform marginal on [0, 1] are available (Johnson 1987).
1987). However, for this
(Plackett 1965),
simulation, we chose Plackett’s uniform distribution (Plackett 1965), which is given
by
ψ
ψ =→10 UU and V,
= 1 V− are independent,
ψ → ∞ U = V,
Table 1 presents the relative efficiencies of our estimators using BVRSIS in compar-
ison with using BVUSS and BVSRSIS relative to BVUSS for estimating (3.7 ( 3.7)).
As illustrated in Table 1, BVSRSIS is clearly more efficient than either BVUSS
or BVRSIS when used for estimation.
3.3.2
3.3.2 Illus
Illustrat
tration
ion When
When tthe
he Int
Integra
egral’
l’ss Limits
Limits A
Are
re Arbitr
Arbitrary
ary
Subset of R 2
Recent work by Samawi and Al-Saleh (2007(2007)) and Samawi and Vogel
Vogel ((2013
2013)) used an
identical example in which the range of the integral was not (0, 1), and the authors
evaluated the bivariate normal distribution (e.g., g (x , y ) is the N2 (0, 0, 1, 1, ρ )
density.) For integrations with high dimensions and a requirement of low relative
error, the evaluation of the multivariate normal distribution function remains one of
the unsolved problems in simulation (e.g., Evans and Swartz 1995 1995).
). To demonstrate
how BVSRSIS increases the precision of evaluating the multivariate normal distri-
bution, we illustrate the method by evaluating the bivariate normal distribution as
follows:
z1 z2
θ = g (x , y ) dx dy , (3.9)
−∞ −∞
where g (x , y ) is the N2 (0, 0, 1, 1, ρ ) density.
Given the similar shapes of the marginal of the normal and the marginal of the
logistic probability density functions, it is natural to attempt to approximate the
biva
bivaria
riate
te normal
normal cumcumula
ulati
tive
ve dis
distri
tribu
butio
tion
n fun
functi
ction
on by the bi
biva
varia
riate
te log
logist
istic
ic cumula
cumulati
tive
ve
distribution function. For the multivariate logistic distribution and its properties, see
Johnson and Kotz (1972
(1972).
). The density of the bivariate logistic (Johnson and Kotz
1972)
1972) is chosen to be
√ √ √3
! 2 π 2 e−π( x + y )/ 3 (1
+√ e−π z1/ 3√+ e−π z2 / )
f (x , y ) = , −∞<x < z1 ; − ∞ < y < z2.
π x/ 3 π y/ 3 3
3 1+e
( − +e − ) (3.10)
It can be shown that the marginal of X is given by
26 H.M. Samawi
− √
π x/ 3 − √
πz / 3 − √
πz / 3
πe
√ +(1e+−πex /√3 + e−+π ze /√3 2 ) ,
1 2
√3 √ √
π X/ 3 π z2 / 3
Now let W =Y+ π ln(1 + e− + e− ). Then it can be shown that
− √
π w/ 3
2π e
| =√
f (w x ) √ √3 3
,
1 e√−π x / 3 √
+
3
1 e−π x / 3 e−π z 2 / 3
+ + + e− π w/
(3.12)
√ √ √
3 +− π x/ 3 − πz / 3
−∞<w<z + π 2 ln 1 e +e 2
.
1. Gen
Generate X from (3.11
erate (3.11).
).
2. Gen
Genera
erate
te W independently
√ from√(3.12
(3.12)) √
3 π X/ 3 π z2 / 3
Lett Y
3. The
4. Le
Then n theW = −
resulti
π ln(1
resultingng pair ( X +e− correct
e−, Y ) has the +
). probability density function, as
defined in (3.10
( 3.10)).
For this illustration, two bivariate sample sizes, n 20 and 40, and different =
values of ρ and ( z 1, z 2 ) are used. To estimate the variances using simulation, we use
2,000 simulated samples from BVUSS, BVRSIS, and BVSRSIS (Tables 2 and 3 3).
).
Notabl
Not ably
y, when
when Sam
Samaw ogel (2013)
awii and Vogel 2013) used
used id
iden
enti
tica
call ex
exam
ampl
ples
es to th
thos
osee used
used by
Samawi and Al-Saleh ((2007 2007),), a comparison of the simulations showed that Samawi
and Vogel
Vogel (2013)
2013) BVSRSIS approach improved the efficiency of estimating the
multiple
multi ple integr
integrals
als by a factor ranging from 2 to 100.
As expected, the results of the simulation indicated that using BVSRSIS substan-
tially improved the performance of the importance sampling method for integration
Table
able 3 Relative
Relative efficienc
efficiency estimating Eq. (3.9)
y of estimating 3.9) using BVRSIS
BVRSIS as compared
compared with
with using BVUSS
BVUSS
(z 1 , z 2 ) n = 20 n = 40
ρ 0.20 ρ 0.50 ρ 0.80 ρ 0.95 ρ 0.20 ρ 0.50 ρ 0.80 ρ 0.95
(0 , 0 ) 2.39= 3.02 = 2.29 = 3.21 = 3.80= 4.85= 3.73= 5.66=
− −1)
( 1, 4.73 4.30 4.04 3.79 7.01 8.28 7.83 7.59
(−2, −2) 8.44 8.47 8.73 5.65 15.69 15.67 16.85 11.02
Source Extracted from Samawi and Al-Saleh (2007
(2007))
4 Steady-S
Steady-State
tate Ranked
Ranked Gibbs
Gibbs Samp
Sampler
ler
Many approximation techniques are found in the literature, including Monte-Carlo
methods, asymptotic, and Markov chain Monte-Carlo (MCMC) methods such as the
Gibbs sampler (Evans and Swartz 1995
1995).
). Recently, many statisticians have become
interested in MCMC methods to simulate complex, nonstandard multivariate distri-
butions. Of the MCMC methods, the Gibbs sampling algorithm is one of the best
known and most frequently used MCMC method. The impact of the Gibbs sampler
method on Bayesian statistics has been detailed by many authors (e.g., Chib and
Greenberg 1994;
1994; Tanner
Tanner 1993
1993)) following the work of Tanner and wong ((1987
1987)) and
Gelfand and Smith (1990
(1990).
).
28 H.M. Samawi
To understand the MCMC process, suppose that we need to evaluate the Monte-
Carlo integration E[f(X)], where f (.) is any user-defined function of a random vari-
able X . The MCMC process is as follows: Generate a sequence of random variables,
X , X , X , . . . , such that at each time t 0, the next state X is sampled from
{ 0 1 2 } ≥ t +1
|
a distribution P ( X t +1 X t ) which depends only on the current state of the chain,
chain, X t .
|
This sequence is called a Markov chain, and P (. .) is called the transition kernel
of the chain. The transition kernel is a conditional distribution function that repre-
sents the probability of moving from X t to the next point X t +1 in the support of X .
Assume that the chain is time homogenous. Thus, after a sufficiently long burn-in
{ ; = +
of k iterations, X t t }
k 1, . . . , n will be dependent samples from the station-
ary distribution. Burn-in samples are usually discarded for this calculation, given an
estimator,
n
1
¯≈
f f ( X t ). (4.1)
n −k =+
t k 1
Initialize X 0 set ;
set t = 0.
{
Repeat generate a candidate Y from q (. X t ) |
and a value u from a uniform (0, 1), if
u α( X t , Y ) set X t +1
≤ Y
Otherwise set X t +1 Xt ==
Increment t . }
A specia
speciall case
case of the Metropo
Metropolis
lis–Ha
–Hasti
stings
ngs alg
algori
orithm
thmis
is the Gibbs
Gibbs sampli
sampling
ng method
method
propo
propose
sed
d by Gema
Geman n an
and
d Gema
Gemann (1984
1984)) and
and intr
introd
oduc
uced
ed by Gelf
Gelfan
and
d and
and Smit
Smith
h (1990
1990).
).
To date, most statistical applications of MCMC have used Gibbs sampling. In Gibbs
sam
sampli
pling,
ng,sampling
Gibbs varia
variable
bless uses
are sample
sam
an pled
d one atto
algorithm a tim
time
e fro
from
mrandom
generate the
their
ir full
fulvariables
l condit
condition
ional
al distri
fromdisatribu
butio
tions.
ns.
marginal
distri
distribu
butio
tion
n ind
indire
irectl
ctlyy, wit
without
hout ca
calcu
lculat
lating
ing the densit
density
y. Simila
Similarr to Casell
Casellaa and Georg
Georgee
(1992),
1992), we demonstrate the usefulness and the validity of the steady-state Gibbs
sampling algorithm by exploring simple cases. This example shows that steady-state
Gibbs sampling is based only on elementary properties of Markov chains and the
properties of BVSRSIS.
that f (x , y1 , y2 , . . . , y g ) is a join
Supposee that
Suppos jointt de
dens
nsit
ityy fu
func
nction on R g+1 an
tion and
d our purp
purpos
osee
is to find
find the
the ch
char
arac
acte
teri
rist
stic
icss of the
the mamargrgin
inal
al de
dens
nsit
ity
y such
such as th
thee me
mean
an and
and th
thee vari
varian
ance
ce..
f X (x ) = ... f (x , y1 , y2 , . . . , y g )d y1 , d y2 , . . . , d yg (4.3)
In cases where ((4.3 4.3)) is extremely difficult or not feasible to perform either analyt-
ically or numerically, Gibbs sampling enables the statistician to efficiently generate
a sample X 1 , . . . , X n ∼ f x (x ), without requiring f x (x ). If the sample size n is large
enough, this method will provide a desirable degree of accuracy for estimating the
mean and the variance of f x (x ).
The fol
follo
lowin
wing g dis
discus
cussio
sion n of the Gibbs
Gibbs sampli
sampling
ng metho
method d uses
uses a two
two-v
-vari
ariabl
ablee case
case to
make
ma ke the
the meth
method od sisimp
mplelerr to fo
foll
llo
ow. A case
case wi
with
th more
more ththan
an two
two varia
ariabl
bles
es is illu
illust
stra
rate
ted
d
X 0 , Y0 , X 1 , Y1 , . . . , X k , Yk , (4.4)
is to start from an initial value Y0 y0 , (which is a known or specified value) and
=
obtaining the rest of the sequence (4.4
(4.4)) iteratively by alternately generating values
from
X j f X Y (x Y j y j )
Y j +1∼∼ f| | (|y| X= = x ).
Y X j j
(4.5)
30 H.M. Samawi
For large k and under reasonable conditions (Gelfand and Smith 1990),
1990), the final
observation in (4.5), namely X j
(4.5), x j , is effectively a sample point from f X (x ).
=
A natural way to obtain an independent and identically distributed (i.i.d) sample
from f X (x ) is to follow the suggestion of Gelfand and Smith (1990) 1990) to use Gibbs
sampling to find the k th, or final value, from n independent repetitions of the Gibbs
sequence in (4.5
(4.5).
). Alternatively, we can generate one long Gibbs sequence and use a
systematic sampling technique to extract every r th observation. For large enough r ,
this
this meth
methodod will
will also
also yiel
yield
d an appro
approxi
xima
mate
te i.
i.i.
i.d
d samp
sample
le from f X (x ). Fo
from Forr th
thee adv
advanta
antagege
and disadvantage of this alternate method see, Gelman and Rubin (1991 (1991).).
Next, we provide a brief explanation of why Gibbs sampling works under reason-
ablee condit
abl condition
ions.
s. Sup
Suppos
posee we know
know the condit
condition
ional
al den
densit ies f X |Y (x y ) and f Y | X ( y x )
sities | |
of the two random variables X and Y , respectively. Then the marginal density of X ,
f X (x ) can be determined as follows:
f X (x ) = f (x , y )d y ,
To guaran
guarantee tee an unbi
unbiase
ased
d est
estima
imator
tor for the mean,
mean, densit
density
y, and the distri
distribu
butio
tion
n functio
function
n
of f X (x ), Samawi et al. ((2012
2012)) introduced two methods for performing steady-state
Gibbs sampling. The first method is as follows:
In standardf Gibbs sampling, the Gibbs sequence is obtained using the conditional
| |
distribution, X |Y ( x y ) and f Y | X ( y x ) , to generate a sequence of random variables,
starting from an initial, specified value Y 0 y0 and iteratively obtaining the rest of
=
the sequence (4.7
(4.7)) by alternately generating values from
X j ∼ f | (x |Y = y )
X Y j j
(4.8)
Y j +1 ∼ f | ( y| X = x ).
Y X j j
However, in steady state Gibbs sampling (SSGS), the Gibbs sequence is obtained as
follows:
One step before the k th step in the standard Gibbs sampling method, take the last
step as
X i ( j ) FX−|1Y (Ui ( j ) Yk −1
∼ yk −1 ) | =
Y(i ) j ∼ F −| (W | X − = x − ),
1
Y X (i ) j k 1 k 1
(4.9)
X [i ]( j ) ∼ F −| (U |Y = y )
1
X Y i( j) (i ) j (i ) j
X i ( j ) ∼ F −| (U |Y − = y − ),
1
X Y i( j) r 1 r 1
Y(i ) j ∼ F −| (W | X − = x − ),
1
Y X (i ) j r 1 r 1
−1 (4.10)
X [i ]( j ) FX |Y (Ui ( j ) Y(i ) j | =y (i ) j ),
Y(i )[ j ] ∼ F −| 1
Y X ( W(i ) j
|X = x
i( j) i ( j ) ),
32 H.M. Samawi
FY−|1X (W(i ) j | X ) = Y [ ] | X ∼ f ∞[ ]| ( y| X ).
i( j) (i ) j i( j) Y(i ) j Xi( j) i( j)
i( j)
Q (i
(i ) Q (i
(i ) Q ( jj))
n n n
f X [i ]( j ) ( x ) = n 2 f Y ( y ). f X |Y ( x y )d y
| = n f X |Y ( x y )n | |
f X (t ) f Y | X ( y t )dtdy
Q (i
( i −1) Q (i
( i −1) Q ( j −1)
n n n
Q ( j)
n
Q (i
(i )
n
= |
n f Y | X ( y t ) f X |Y ( x y )d y | n f X (t )d t (4.11)
Q ( j 1)
− − Q (i
( i 1)
n n
Q ( j) Q (i
n
(i )
n
n fY X (y t ) f X Y ( x y )d y fX i ( j ) (t )d t .
=Q − −
( j 1)
n
Q (i
( i 1)
n
| | | | []
Q Y (i )/
)/nn Q X ( j )/
)/nn
n n n n
1 1 ( )
E ( X SSGS )
¯ = n2 i =1 j =1 E ( X [i ]( j ) ) = f [i ]( j )Y (i )[ j ] (x , y )d x d y
n2
x X ∞
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n n
1
= n2 x n 2 f ( x , y )d x d y
= =
i 1 j 1Q
)/ n Q X ( j −1)/n
Y (i −1)/n )/ n
Q X ( j )/n
)/ n Q Y (i )/n
)/ n
n n
= x f x ( x )d x |
f Y | X ( y x )d y = E ( X ) = µx .
=
j 1Q
X ( j −1)/n
)/ n
=
i 1Q
Y (i −1)/n
)/ n
Similarly,
Q Y (i )/
)/nn Q X ( j)/
j )/n
n
n n n n
¯
var( X SSGS ) = n14 V ar ( X [i ]( j ) ) = n14 (x − µx [ ] i ( j)
)2 f (∞) X [i ]( j )Y (i )[ j ] ( x , y)
y )d x d y
= =
i 1 j 1 = =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn
Q Y (i )/
)/nn Q X ( jj)/
)/nn
n
n
1
= n4 (x − µx[ ] ±µx )2 n2 f (x , y)
i ( j)
y )d x d y
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn
Q Y (i )/
)/nn Q X ( jj)/
)/nn
n
n
1
= n2 [(x − µx ) − (µ[i ] j − µx ]2 f (x , y)
y )d x d y
( )
= =
i 1 j 1Q
Y (i −1)/
)/nn Q X ( j −1)/
)/nn
and,
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n n
1
var( X SSGS ) (x µx )2 2(x µx )(µx i ( j)
µx )
¯ = n2 i 1 j 1Q 1 Q 1 { −
= = − Y (i )/ n
)/n X( j
− )/ n
)/n
− − [] −
+ (µx[ ] − µx )2 } f (x , y )d x d y
i ( j)
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
n n n
1 2 1
=
2
(x − µx ) f (x , y )dxdy − 2
n n
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n − = =
i 1j 1
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
2( x − µx )(µx[ ] − µx ) f (x , y )d x d y
i ( j)
Q Y (i 1)/n
−)/ n Q X ( j 1)/n
)/ n−
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
n
1
+ n2 (µx[i ]( j ) − µx )2 f (x , y )d x d y
i 1 j 1
= = Q Y (i 1)/n)/ n Q X ( j 1)/n
− )/ n −
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
n
n n n
1 2 1
= n2 (x − µx ) f ( x , y )d x d y − n2
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n − = =
i 1j 1
34 H.M. Samawi
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
(µx[i ]( j ) − µx )2 f (x , y )d x d y
Q Y (i 1)/n
−)/ n Q X ( j 1)/n
)/ n−
Q Y (i )/n
)/ n Q X ( j )/n
)/ n
2 n n 2
σX
− µx )2 f (x , y )d x d y ≤ V ( X¯ ) = σnX2 ,
= − n2 1
(µx[i ]( j )
n2
= =
i 1 j 1Q
)/ n Q X ( j 1)/n
Y (i 1)/n − )/ n −
This section presents the results of a simulation study comparing the performance
of the SSGS with the standard Gibbs sampling methods. To compare the perfor-
mance of our proposed algorithm, we used the same illustrations as Casella and
George (1992 ). For these examples, four bivariate samples of sizes, n
(1992). 10, 20, and =
50 and Gibbs sequence length k 20, 50 and 100 and r =
20, 50, and 100in the =
long sequence Gibbs sampler. To estimate the variances of the estimators using the
simulation method, we completed 5,000 replications. Using the 5,000 replications,
we estimate the efficiency of our procedure relative to the traditional (i.e., standard)
Gibbs sampling method by eff(θ̂θ , θ̂θ SSGS ) ˆ ˆ = V ar (θ̂θ )
V ar (θθ̂ SSGS ) ,
ˆˆ where θ is the parameter of
interest.
Tables 4 and 5 show that, relative to the standard Gibbs sampling method, SSGS
improves the efficiency of estimating the marginal means. The amount of improve-
ment depends on two factors: (1) which parameters we intend to estimate, and (2)
the conditional distributions used in the process. Moreover, using the short or long
Gibbs sampling sequence has only a slight effect on the relative efficiency.
Table 4 Standard
Standard Gibbssamp
Gibbs sampling
ling method
method compared
comparedWit
With
h the Steady-Stat
Steady-Statee GibbsSampl
Gibbs Sampling
ing (SSGS)
method (Beta-Binomial distribution)
m = 5, α = 2, and β = 4
n2 k Sample Sample Relative Sample Sample Relative
mean mean efficiency mean mean efficiency
Gibbs SSGS of Gibbs SSGS of
sampling X sampling Y
of X of Y
100 20 1.672 1.668 3.443 0..340
0 0.334 3.787
50 1.667 1.666 3.404 0..333
0 0.333 3.750
100 1.667 1.666 3.328 0..333
0 0.333 3.679
400 20 1.666 1.666 3.642 0..333
0 0.333 3.861
50 1.669 1.667 3.495 0..333
0 0.333 3.955
100 1.668 1.667 3.605 0..333
0 0.333 4.002
2500 20 1.666 1.666 3.760 0..333
0 0.333 4.063
50 1.668 1.667 3.786 0..333
0 0.333 3.991
100 1.667 1.667 3.774 0..333
0 0.333 4.007
m = 16, α = 2, and β = 4
100 20 5.321 5.324 1.776 0..333
0 0.333 1.766
50 5.334 5.334 1.771 0..333
0 0.333 1.766
100 5.340 5.337 1.771 0..334
0 0.334 1.769
400 20 5.324 5.327 1.805 0..333
0 0.333 1.811
50 5.333 5.333 1.816 0..333
0 0.333 1.809
100 5.330 5.331 1.803 0..333
0 0.333 1.806
2500 20 5.322 5.325 1.820 0..333
0 0.333 1.828
50 5.334 5.334 1.812 0..333
0 0.333 1.827
100 5.334 5.333 1.798 0..333
0 0.333 1.820
Note The exact mean of x is equal to 5/3 and the exact mean of y is equal to 1/3 for the first case
36 H.M. Samawi
Table 5 Compariso
Comparisonn of the Long Gibbs sampling method
method and the Steady-State
Steady-State Gibbs Sampling
(SSGS) method (Beta-Binomial distribution)
m = 5, α = 2, and β = 4
n2 r Sample Sample Relative Sample Sample Relative
mean mean efficiency mean mean efficiency
Gibbs SSGS of Gibbs SSGS of
sampling X sampling Y
of X of Y
100 20 1.667 1.666 3.404 0..333
0 0.333 3.655
50 1.665 1.665 3.506 0..333
0 0.333 3.670
100 1.668 1.667 3.432 0..334
0 0.333 3.705
400 20 1.667 1.667 3.623 0..333
0 0.333 4.014
50 1.666 1.666 3.606 0..333
0 0.333 3.945
100 1.667 1.667 3.677 0..333
0 0.333 3.997
2500 20 1.667 1.666 3.814 0..333
0 0.333 4.011
50 1.667 1.667 3.760 0..333
0 0.333 4.125
100 1.667 1.667 3.786 0..333
0 0.333 4.114
m = 16, α = 2, and β = 4
100 20 5.338 5.338 1.770 0..334
0 0.334 1.785
50 5.335 5.334 1.767 0..334
0 0.334 1.791
100 5.335 5.334 1.744 0..334
0 0.333 1.763
400 20 5.332 5.332 1.788 0..333
0 0.333 1.820
50 5.337 5.336 1.798 0..333
0 0.333 1.815
100 5.332 5.333 1.821 0..333
0 0.333 1.820
2500 20 5.332 5.332 1.809 0..333
0 0.333 1.821
50 5.335 5.335 1.832 0..333
0 0.333 1.827
100 5.333 5.333 1.825 0..333
0 0.333 1.806
Note The exact mean of x is equal to 5/3 and the exact mean of y is equal to 1/3
[(1− y)λ] −
m x
f (m x , y ) e−(1− y )λ (m −x )! , m
| ∞ = +
x , x 1, . . .. For this example, we used the
following parameters: m =
5, α =
2, and β 4. =
Similarly, Table 7 illustrates the improved efficiency of using SSGS for marginal
means estimation, relative to standard Gibbs sampling. Again using a short or long
Gibbs sampling sequence has only a slight effect on the relative efficiency. Note that
this example is a three-dimensional problem, which shows the improved efficiency
depends on the parameters under consideration.
We show that SSGS converges in the same manner as in the standard Gibbs
sampling method. Howe
Howeverver,, Sects. 3 and 4 indicate that SSGS is more efficient than
stan
standa
dard
rd Gi
Gibb
bbss samp
sampli
ling
ng fo
forr esti
estima
mati
ting
ng the
the me
mean
anss of th
thee marg
margin
inal
al di
dist
stri
ribu
buti
tion
onss usin
using
g
the same sample size. In the examples provided above, the SSGS efficiency (versus
standard Gibbs) ranged from 1.77 to 6.6, depending on whether Gibbs sampling
used the long or short sequence method and the type of conditional distributions
used in the process. Using SSGS yielded a reduced sample size, and thus, reduces
38 H.M. Samawi
y
c
e n
itv ie 3 1 1 8 1 8 0 8 9 3 1 1 8 1 8 0 8 9
la c 3 4 3 3 2 1 4 3 3 3 4 3 3 2 1 4 3 3
R
e fi
fe .3
5 3
.
5 .3
5 .3
5 .3
5 .3
5 .3
5 .3
5 .3
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5 3
.
5
n
a
) e
n m
ito S 3 3 3 2 9 9 9 9 9 3 3 3 2 9 9 9 9 9
u G 4 4 4 4 3 3 3 3 3 4 4 4 4 3 3 3 3 3
ib S 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
. 0
.
trs S Z 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
i
D n
n a
e
o
s m
is s 1 9 0 0 1 0 9 1 0 1 9 0 0 1 0 9 1 0
o b Z 0 9 0 0 0 0 9 0 0 0 9 0 0 0 0 9 0 0
P ib f 0
. 9
. 0
. 0
. 0
. 0
. 9
. 0
. 0
. 0
. 9
. 0
. 0
. 0
. 0
. 9
. 0
. 0
.
d G o 5 4 5 5 5 5 4 5 5 5 4 5 5 5 5 4 5 5
n
a
n
ito y
u e c
itv n
ib ie
trs la c 8
6
8
9
2
5
0
1
7
7
0
8
1
9
4
8
3
8
8
6
8
9
2
5
0
1
7
7
0
8
1
9
4
8
3
8
4 2 3 5 4 4 5 5 5 4 2 3 5 4 4 5 5 5
ilD R
e fi
fe .6 .
6 .6 .6 .6 .6 .6 .6 .6 .
6 .
6 .
6 .
6 .
6 .
6 .
6 .
6 .
6
ia n
m a
e
o
in m
-B S 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
G 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
ta e S 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
. 3
.
B S Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
(
d n
a
o e
th e m
m s
) b Y 3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
S b
i f 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
. . . . . . . . . . . . . . . . . .
G
S
G o 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 /3
1
(S to
g y l
iln e c a
itv n u
p ie 3 1 1 9 7 4 9 9 2 3 1 1 9 7 4 9 9 2 q
e
m
a la c
fi
8
8
9
8
8
8
9
8
9
8
9
8
0
9
9
8
9
8
8
8
9
8
8
8
9
8
9
8
9
8
0
9
9
8
9
8 s
S R
e fe .1 .
1 .1 .1 .1 .1 .1 .1 .1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 .
1 iy
s f
b o
ib n 4
a n
G 4 e a
m
= e
te β m
a = S 7 6 6 8 7 5 6 5 7 7 6 6 8 7 5 6 5 7
t 5 5 5 5 5 5 5 5 5 d 5 5 5 5 5 5 5 5 5 t
G c
-S β
S 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. n
a 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. 7
. a
x
y d S X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 e
d n ,
a a 2 e
te , n
a = th
S 2 e d
d = m α n
n ,
5 a
a s
d
α
, b X 7
6
5
6
6
6
7
6
6
6
6
6
8
6
7
6
7
6
0
7
5
6
6
6
5
6
6
6
7
6
8
6
7
6
7
6 /3
o 5 ib f 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
.
= 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 6
. 5
th = G o 1 1 1 1 1 1 1 1 1 λ 1 1 1 1 1 1 1 1 1 to
e l
M λ a
u
g m q
iln tih e
p r m is
o tih x
m lg 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 f
a r
Ss A
s k
2 5 0
1 2 5 0
1 2 5 0
1 lg
o r
2 5 0
1 2 5 0
1 2 5 0
1 o
n
b b A a
ib b e
i s m
G G b t
ib c
7 rd 0 G 0 a
x
a 0 0 0 0 0 0 e
le d 0 0 0 g 0 0 0
b n 0 0 7 n 0 0 7 e
a ta 3
1 8 2 o 3
1 8 2 h
T S n
L n
T
performs at least as well as standard Gibbs sampling and SSGS offers greater accu-
racy. Thus, we recommend using SSGS whenever a Gibbs sampling procedure is
needed. Further investigation is needed to explore additional applications and more
options for using the SSGS approach.
References
Al-Saleh, M. F., & Al-Omari, A. I., (1999). Multistage ranked set sampling. Journal of Statistical
Planning and Inference, 102(2), 273–286.
Al-Saleh, M. F., & Samawi, H. M. (2000). On the efficiency of Monte-Carlo methods using steady
state ranked
ranked simulated
simulated samples.
samples. Commu
Communicat
nication
ion in Statistics
Statistics-- Simulation Computation, 29(3),
Simulation and Computation
941–954. doi:
doi:10.1080/03610910008813647
10.1080/03610910008813647..
Al-Saleh, M. F., & Zheng, G. (2002). Estimation of multiple characteristics using ranked set sam-
pling. Australian & New Zealand Journal of Statistics, 44, 221–232. doi: doi:10.1111/1467-842X.
00224..
00224
Al-Saleh, M. F., & Zheng, G. (2003). Controlled sampling using ranked set sampling. Journal of
Nonparametric Statistics, 15, 505–516. doi:
doi:10.1080/10485250310001604640
10.1080/10485250310001604640..
Casella, G., & George, I. E. (1992). Explaining the Gibbs sampler. American Statistician, 46 (3),
167–174.
Chib, S., & Greenberg, E. (1994). Bayes inference for regression models with ARMA (p, q) errors.
Journal of Econometrics, 64 , 183–206. doi: 10.1016/0304-4076(94)90063-9..
doi:10.1016/0304-4076(94)90063-9
David, H. A. (1981). Order statistics(2nd ed.). New York, NY: Wiley.
Evans, M., & Swartz, T. (1995). Methods for approximating integrals in statistics with special
emphasis on Bayesian integration problems. Statistical Science, 10 (2), 254–272. doi:
doi:10.1214/ss/
1177009938..
1177009938
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal
densities. Journal of the American Statistical Association, 85, 398–409. doi:10.1080/01621459.
doi:10.1080/01621459.
1990.10476213..
1990.10476213
Gelman, A., & Rubin, D. (1991). An overview and approach to inference from iterative simulation
[Technical report]. Berkeley, CA: University of California-Berkeley, Department of Statistics.
Geman, S., & Geman,
Geman, Geman, D. (1984)
(1984).. Stocha
Stochasti
sticc rel
relaxa
axatio
tion,
n, Gibbs
Gibbs distri
distribu
butio
tions
ns and the Bayesi
Bayesian
an restor
restora-
a-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 6 , 721–741.
doi::10.1109/TPAMI.1984.4767596 .
doi
Hammersley, J. M., & Handscomb, D. C. (1964). Monte-Carlo methods. London, UK: Chapman
& Hall. doi:10.1007/978-94-009-5819-7
doi:10.1007/978-94-009-5819-7..
Hastin
Has tings,
gs, W. K. (1970)
(1970).. Monte-
Monte-Car
Carlo
lo sampli
sampling
ng method
methodss using
using Marko
Markov v chains
chains and their
their applic
applicati
ations
ons..
doi:10.1093/biomet/57.1.97.
Biometrika, 57 , 97–109. doi: 10.1093/biomet/57.1.97.
Johnson, M. E. (1987). Multivariate statistical simulation. New York, NY: Wiley. doi:10.1002/
doi: 10.1002/
9781118150740..
9781118150740
Johnson, N. L., & Kotz, S. (1972). Distribution in statistics: Continuous multivariate distributions.
New York, NY: Wiley.
Liu, J. S. (2001).
(2001). Monte-Carlo strategies in scientific computing. New York, NY: Springer.
McIntyre, G. A. (1952). A method of unbiased selective sampling, using ranked sets. Australian
Journal of Agricultural Research, 3 , 385–390. doi: 10.1071/AR9520385..
doi:10.1071/AR9520385
Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller,
Teller, A. H., & Teller, E. (1953). Equations
of state calculations by fast computing machine. Journal of Chemical Physics , 21 , 1087–1091.
doi::10.1063/1.1699114.
doi 10.1063/1.1699114.
doi:10.1007/978-
Morgan, B. J. T. (1984). Elements of simulation. London, UK: Chapman & Hall. doi:10.1007/978-
1-4899-3282-2..
1-4899-3282-2
40 H.M. Samawi
Plackett, R. L. (1965). A class of bivariate distributions. Journal of the American Statistical Asso-
ciation, 60 , 516–522. doi:
doi:10.1080/01621459.1965.10480807
10.1080/01621459.1965.10480807..
Robert, C., & Casella, G. (2004). Monte-Ca
Monte-Carlorlo statistical methods (2nd ed.). New York, NY:
statistical methods
Springer.. doi:
Springer 10.1007/978-1-4757-4145-2..
doi:10.1007/978-1-4757-4145-2
Roberts, G. O. (1995). Markov chain concepts related to sampling algorithms. In W. R. Gilks, S.
Richardson, & D. J. Spiegelhalter (Eds.), Markov Chain Monte-Carlo in practice (pp. 45–57).
London, UK: Chapman & Hall.
Samaw
Sam awi,i, H. M. (1999)
(1999).. More
More ef
effici
ficient
ent Monte-
Monte-Car
Carlo
lo method
methodss obtain
obtained
ed by using
using ranked
ranked set simula
simulated
ted
samples. Communication in Statistics-Simulation and Computation, 28 , 699–713. doi:10.1080/ doi:10.1080/
03610919908813573..
03610919908813573
Samaw
Sam awii H. M., & Al-Sal
Al-Saleh,
eh, M. F. (2007)
(2007).. On the approx
approxima
imatio
tionn of multip
multiple
le integ
integral
ralss usi
using
ng multiv
multivari
ari--
ate ranked simulated sampling. Applied Mathematics and Computation , 188 , 345–352. doi: doi:10.
1016/j.amc.2006.09.121..
1016/j.amc.2006.09.121
Samawi, H. M., Dunbar, M., & Chen, D. (2012). Steady state ranked Gibbs sampler. Journal of
Statistical Simulation and Computation, 82, 1223–1238. doi: doi:10.1080/00949655.2011.575378
10.1080/00949655.2011.575378..
Samawi, H. M., & Vogel, R. (2013). More efficient approximation of multiple integration using
steady state ranked simulated sampling. Communications in Statistics-Simulation and Computa-
tion, 42 , 370–381. doi:
doi:10.1080/03610918.2011.636856
10.1080/03610918.2011.636856..
Shreider, Y. A. (1966). The Monte-Carlo method . Oxford, UK: Pergamom Press.
Stokes
Sto kes,, S. L. (1977)
(1977).. Ranke
Rankedd set sampli
sampling
ng with
with concom
concomitaitant
nt va
varia
riable
bles.
s. Commun
Communicatio
ications
ns in
Statistics—Theory and Methods, 6 , 1207–1211. doi: doi:10.1080/03610927708827563.
10.1080/03610927708827563.
Tanner,
anner, M. doi: 10.1007/
M . A. (1993). Tools for statistical inference (2nd ed.). New York, NY: Wiley. doi:10.1007/
978-1-4684-0192-9..
978-1-4684-0192-9
Tanner, M. A., & Wong, W. (1987). The calculation of posterior distribution by data augmentation
(with discussion). Journal of the American Statistical Association , 82, 528–550. http://dx.doi.
org/10.1080/01621459.1987.10478458..
org/10.1080/01621459.1987.10478458
Tierney, L. (1995). Introduction to general state-space Markov chain theory. In W. R. Gilks, S.
Richardson
Richa rdson,, & D. J. Spiegelhal
Spiegelhalter
ter (Eds.),
(Eds.), Markov Chain Monte-Carlo in practice (pp. 59–74).
London, UK: Chapman & Hall.
Normal and Non-normal Data Simulations
for the Evaluation of Two-Sample Location
Tests
Abstract Two-sample location tests refer to the family of statistical tests that com-
paree tw
par twoo ind
indepe
epende
ndent
nt dis
distri
tribut
bution
ionss via mea
measur
sureses of cen
centra
trall ten
tenden
dencycy,, mos
mostt com
common
monlyly
means
mea ns or me
media ns. The t -te
dians. -test
st is the mos
mostt rec
recogni
ognizedzed par
parame
ametri tricc opt
option
ion for tw
two-s
o-samp
amplele
mean
mea n com
compar pariso
isons.
ns. The poo
pooled
led t -te
-test
st ass
assume
umess the tw
two o popu
populat lation
ion va
varia
riance
ncess are equ
equal.
al.
Under circumstances where the two population variances are unequal, Welch’ Welch’ss t -test
is a more appropriate test. Both of these t -tests require data to be normally distrib-
uted
uted.. If th
thee no
norm
rmal alit
ity
y as
assu
sump
mptition
on is vi
viol
olat
ated
ed,, a no
non-
n-pa
pararame
metr tric
ic al
alte
terna
rnati
tive
ve su
such
ch as th
thee
Wilcoxon rank-sum test has potential to maintain adequate type I error and appre-
ciable power. While sometimes considered controversial, pretesting for normality
followed by the F -test for equality of variances may be applied before selecting a
two-sample location test. This option results in multi-stage tests as another alterna-
tive for two-sample location comparisons,
comparisons, starting with a normalitynormality test, followed by
elch’ss t -te
either Welch’ -test
st or the Wilcilcoxon
oxon ran
rank-s
k-sum
um testest.
t. Les
Lesss com
common monly ly uti
utiliz
lized
ed alt
altern
erna-
a-
tives for two-sample location comparisons include permutation tests, which evaluate
statistical significance based on empirical distributions of test statistics. Overall, a
variety of statistical tests are available for two-sample location comparisons. Which
tests demonstrate the best performance in terms of type I error and power depends
on variations in data distribution, population variance, and sample size. One way to
eva
valu
luat
atee th
thes
esee te
test
stss is to si
simu
mula
late
te da
data
ta th
that
at mi
mimi
micc wh
what
at mi
migh
ghtt be en
enco
count
unter
ered
ed in pr
prac
ac--
tice. In this chapter, the use of Monte Carlo techniques are demonstrated to simulate
normal and non-normal data for the evaluation of two-sample location tests.
1 Intr
Introd
oduc
ucti
tion
on
Statistical tests for two-sample location comparison include t -tests, Wilcoxon rank-
sum test, permutation tests, and multi-stage tests. The t -test is the most recognized
parame
par ametri
tricc te
test
st for tw
two-s
o-samp
ample le mea
meann com
compar
pariso
ison.
n. It ass
assume
umess ind
indepe
epende
ndentnt ide
identi
ntical
cally
ly
distributed (i.i.d) samples from two normally distributed populations. The pooled t -
testt (al
tes (also
so ca
calle
lled
d Stu
Studen t’ss t -te
dent’ -test)
st) add
additi
itiona
onally
lly ass
assume
umess the equ
equali
ality
ty of va
varia
riance
ncess whi
whilele
the Welch’s t -test does not. Other tests such as Wilcoxon rank-sum test (also called
Mann-W
Man n-Whit
hitneney
y tes
test)
t) and Fis
Fisher
her-Pi
-Pitma
tmann per
permut
mutati
ation
on tes
tests
ts mak
makee no suc
suchh dis
distri
tribu
butio
tional
nal
assumptions; thus, are theoretically robust against non-normal distributions. Multi-
stage tests that comprise preliminary tests for normality and/or equality of variance
before a two-sample location comparison test is attempted may also be used as
alternatives.
When the normal assumption is met, t -tests are optimal. The pooled t -test is also
robust to the equal variance assumption, but only when the sample sizes are equal
(Zimmerman 2004 2004).). Welch’s t -test does not alter the type I error rate regardless of
unequal sample sizes and performs as well as the pooled t -test when the equality
of variances is held (Zimmerman 2004 2004;; Kohr and Games 1974 1974),), but it can result in
substantial type II error under heavily non-normal distributions and for small sample
sizes, e.g. Beasley et al. (2009
( 2009). ). Under circumstances of equal variance with either
equal or unequal sample sizes less than or equal to 5, De Winter recommended the
pooled t -tes
-testt ove
overr Welch
elch’’s t -te
-test,
st, pro
provid
vided
ed an ade
adequaquate
tely
ly lar
large
ge popu
populalatio
tionn ef
effec
fectt siz
size,
e,
due to a significant loss in statistical power associated with Welch’ elch’ss t -test (de Wint
Winter er
2013).
2013 ). The power loss associated with Welch’s t -test in this scenario is attributed to
its lower degrees of freedom compared to the pooled t -test (de Winter Winter 2013
2013). ).
In samples with known non-normal distributions, the Wilcoxon rank-sum test is
far superior than parametric tests, with power advantages actually increasing with
increasing
incre asing sampl
samplee size (Saw
(Sawilo ilowsky
wsky 2005
2005).). Th
Thee Wililco
coxon
xon ra
rank
nk-s
-sum
um tetest
st,, ho
howe
weveverr, is
not a sol
soluti
ution
on to het
hetero
erogen
geneou
eouss va
varia
riance
nce.. Sim
Simply
ply,, het
hetero
erogen
geneou
eouss va
varia
riance
nce is mit
mitiga
igate
ted
d
but does not disappear when actual data is convconverted
erted to ranks (Zimmerman 1996
1996).
). In
addition to the Wilcoxon rank-sum test, permutation tests have been recommended
as a supplement to t -tests across a range of conditions, with emphasis on samples
with non-normal distributions. Boik (1987 (1987),
), however, found that the Fisher-Pitman
permutation test (Ernst et al. 2004)) is no more robust than the t -test in terms of type
al. 2004
I error rate under circumstances of heterogeneous variance.
A pretest for equal variance followed by the pooled t -test or Welch’s t -test, how-
ever appealing, fails to protect type I error rate (Zimmerman 2004 2004;; Rasch et al.
2011;; Schucany and Tony Ng 2006
2011 2006). ). With the application of a pre-test for normal-
ity (e.g. Shapiro-Wilk Royston 19821982), ), however, it remains difficult to fully assess
the normality assumption—acceptance of the null hypothesis is predicated only on
insufficient evidence to reject it (Schucany and Tony Ng 2006
2006).
). A three-stage pro-
cedure which also includes a test for equal variance (e.g. F -test or Levenes test)
following a normality test is also commonly applied in practice, but as Rash and
colleagues have pointed out, pretesting biases type I and type II conditional error
t -test should be replaced with Welch’s t -test in general practice because it is robust
against heterogeneous variance (Rasch et al. al. 2011
2011;; Welch 19381938).).
In summary, the choice of a test for two-sample location comparison depends on
varia
va riatio
tions
ns in dat
dataa dis
distri
tribut
bution
ion,, popu
populat lation
ion va
varia
riance
nce,, and sam
sample
ple siz
size.e. Pre
Previo
viousus stu
studie
diess
suggested Welch’ elch’ss t -t
-tes
estt for no
normrmalal da
datata an
andd Wil
ilco
coxon
xon rarank-
nk-su
sum
m te
testst fo
forr non
non-n-nor
ormamall
data without heterogeneous variance (Ruxton 2006 2006).). Some recommended Welch’s
t -t
-tes
estt for ge
gene
nera
rall us
usee (R
(Rasasch
ch et al
al.. 2011
2011;; Zimmerman 1998 1998).). Howe
However elch’ss t -test
ver,, Welch’
is not a powerful test when the data is extremely non-normal or the sample size is
small (Beasley et al. al. 2009
2009).). Previous Monte Carlo simulations tuned the parameters
ad hoc an andd co
compmparared
ed a lilimi
mite
ted
d se
sele
lect
ctio
ionn of tw
two-
o-sa
samp
mplele lo
loca
cati
tion
on te
teststs.
s. Th
Thee si
simu
mula
lati
tion
on
settings were simplified by studying either non-normal data only, unequal variances
only, small sample sizes only, or unequal sample sizes only.
In this chapter, simulation experiments are designed to mimic what is often
enco
en count
unter
ered
ed in pr
prac
acti
tice
ce.. Sa
Samp
mple le si
size
zess ar
aree ca
calc
lcul
ulat
ated
ed fo
forr a bro
broad
ad ra
rang
ngee of popowe
werr ba
basesed
d
on the pooled t -test, i.e. assuming normally distributed data with equal variance. A
medium effect size is assumed (Cohen 2013 2013)) as well as multiple sample size ratios.
Alth
Al thou
ough
gh th
thee sa
samp
mple le si
size
zess ar
aree de
dete
term
rmin ined
ed as
assu
sumi
ming
ng no
norm
rmalal da
data
ta wi
withth eq
equauall va
vari
rian
ancece,,
the simulations consider normal and moderately non-normal data and allow for het-
erogeneous variance. The simulated data are used to compare two-sample location
tests including parametric tests, non-parametric tests, and permutation-based tests.
The goal of this chapter is to provide insight into these tests and how Monte Carlo
simulation techniques can be applied to demonstrate their evaluation.
2 Statis
Statistic
tical
al Test
estss
Capital letters are used to represent random variables and lowercase letters rep-
resent realized values. Additionally, vectors are presented in bold. Assume x =
x1 , x 2 ,.., xn1 and y = y1 , y2 , .. ., yn2 are two i.i.d. samples from two independent
...,
populations with means µ 1 and µ 2 and variances σ 12 and σ 22 . Denote by x¯ and y¯ the
sample means and s12 ands22 the sample variances. Let z = z 1 , ..
., z n1 , z n1 +1 , ..
..., ...,
.,
z (n1 +n2 ) rerepre
presen
sentt a ve
vecto
ctorr of grou
group p lab
labels
els wit
with h z i = 1 ass
associ
ociate
ated d wit
with h xi for
i = 1 , 2, .. ., n 1 and z (n1 + j ) = 2 associated with y j for j = 1 , 2, ..
..., ., n 2 . The com-
...,
bined x ’s and y ’s are ranked from largest to smallest. For tied observations, the
average rank is assigned and each is still treated uniquely. Denote by ri the rank
associated with x i and si the rank associated with yi . The sum of ranks in x ’s is given
n1 n2
by R = i =1 ri
and by S = i =1 si
for the sum of ranks in y ’s.
Next,the two-sample location tests considered in the simulations are reviewed.
The performance of these tests are evaluated by type I error and power.
2.1 t -T
-Test
est
The pooled t -test and Welch’s t -test assume that each sample is randomly sampled
from a population that is approximately normally distributed. The pooled t -test fur-
ther assumes that the two variances are equal (σ12 = σ 22 ). The test statistic for the
null hypothesis that the two population means are equal ( H0 : µ1 = µ 2 ) is given by
x¯ − y¯
tpooled = , (1)
1 1
sp n1
+ n2
x¯ − y¯
twelch = . (2)
s12 s22
n1
+ n2
s12
F = , (4)
s2
s2
can be applied to test the equality of variances. If the null hypothesis is rejected,
Welch’s t -test is preferred; otherwise, the pooled t -test is used.
The Wilcoxon rank-sum test is essentially a rank-based test. It uses the rank data to
compare two distributions. The test statistic to test against the null hypothesis that
the two distributions are same is given by
U = n 1 n 2 + [n 2 (n 2 + 1)] /2 − R . (5)
2.3 Two-Stage
Two-Stage Test
The choice between t -test and Wilcoxon rank-sum test can be based on whether the
normality assumption is met for both samples. Pior to performing the mean compar-
ison test, the Shapiro-Wilk test (Royston 1982 1982)) can be used to evaluate normality.
If both p-values are greater than the significance level α , t -test is used; otherwise,
Wililco
coxo
xon
n ra
rank
nk-s
-sum
um te
test
st is us
used
ed.. Th
Thee ch
chosen t -te
osen -test
st her
heree is Welch’ss t -t
elch’ -tes
estt for it
itss ro
robu
bust
st--
ness against heterogeneous variance. Since the two normality tests for each sample
are independent, the overall type I error rate (i.e. the family-wise error rate) that at
leastt one hypothesis
leas hypothesis isis incor
incorrect
rectly
ly rejected
rejected is thus controll
controlled
ed at 2α . When α = 2 .5%,
it results in the typical value 5% for the family-wise error rate.
and
B B
2
pp.wilcox = min I (skw ≤ s ow), I (skw > sow) , (7)
B
k =1 k =1
where I (·) = 1 if the condition in the parentheses is true; otherwise, I (·) = 0. Sim-
ilarly, the p-value associated with the minimum p-value is given by
B
1 I m k ( pkt , p kw) ≤ m o ( pot , p ow ) .
pminp =
B
(8)
k =1
3 Simu
Simula
lati
tion
onss
Simula
Simu lateted
d da
data
ta we
were
re us
useded to te test
st th
thee nu
null
ll hy
hypo
poththesis H0 : µ1 = µ 2 vers
esis versus
us the
alternative
alternati ve hypothesis H1 : µ1 = µ 2 . Assume x = { x1 , .. ., xn1 } was simulated from
...,
N (µ1 , σ12 ) and y = { y1 , .. ., yn2 } wa
..., wass sim
simula
ulated
ted from N (µ2 , σ22 ). Wit
from itho
hout
ut lo
losi
sing
ng ge
gen-
n-
erality
erality,, let σ1 and σ2 be se
sett equ
qualal to 1. Let µ1 be ze
zero
ro,, µ2 be 0 und
underer th
thee nu
null
ll hy
hypo
poth
thes
esis
is
and 0.5 under the alternative hypothesis, the conventional value suggested by Cohen
2013)) for a medium effect size. n 1 is set equal to or double the size of n 2 . n 1 and
(2013
n 2 were chosen to detect the difference between µ 1 and µ2 for 20, 40, 60, or 80%
power at 5% significance level. When n 1 = n 2 , the sample size required per group
was 11, 25, 41, and 64, respectively. When n 1 = 2 n 2 , n 1 =18, 38, 62, 96 and n 2 was
half of n 1 in ea
half each
ch se
sett
ttin
ing.
g. Th
Thee sa
same
me sa
samp
mple
le si
size
zess we
were
re us
used
ed for nu
null
ll si
simu
mula
lati
tion
ons,
s, no
non
n
normal, and heteroscedastic settings. Power analysis was conducted using G*Power
software (Faul et al. 2007
2007).
).
Fleishman’s power method to simulate normal and non-normal data is based on
a polynomial function given by
x = f (ω) = a + b ω + c ω2 + d ω3 , (9)
where ω is a random value from the standard normal with mean 0 and standard
deviation 1. The coefficients a , b , c , and d are determined by the first four moments
of X with the first two moments set to 0 and 1. For distributions with mean and
standard
after beingdeviation different
simulated. from 0 the
Let γ3 denote andskewness
1, the data andcan
γ 4 be shifted
denote the and/or
kurtosis.rescaled
γ 3 and
γ4 are both set to 0 if X is normally distributed. The distribution is left-skewed if
γ3 < 0 an
andd ri
righ
ght-
t-sk
skeewe
wed d if γ3 > 0. γ4 is sm
smalalle
lerr th
than
an 0 fo
forr a pl
plat
atok
okurt
urtot
otic
ic di
dist
stri
ribu
butition
on
and
an d gre
great
ater
er tha
than
n 0 fo
forr a le
lept
ptok
okur
urto
toti
ticc dist
distri
ribu
buti
tion.
on. By ththee 12 mo
momement
ntss of the
the st
stan
andadard
rd
normal distribution, we can derive the equations below and solve a , b, c, and d via
the Newton-Raphson method or any other non-linear root-finding method,
a = −c (10)
b2 + 6bd + 2 c2 + 15d 2 − 1 = 0 (11)
2 2
2c b + 24bd + 105d + 2 − γ3 = 0
(12)
24 bd + c 2
1 + b + 28bd + d 2 12 + 48bd + 141c 2 + 225d 2
2
− γ4 = 0 (13)
One limitation of Fleishman’s power method is that it does not cover the entire
domain of skewness and kurtosis. Given γ3 , the relationship between γ3 and γ4 is
described by the inequation, γ4 ≥ γ 32 − 2 (Devroye
(Devroye 1986
1986).
). Precisely, the empirical
lower
By bound of γ 4 given
Fleishman’s
3 = 0 is -1.151320 (Headrick and Sawilowsky 2000
powerγ method, three conditions were investigated: (1)2000). ).
heteroge-
neous variance, (2) skewness, and (3) kurtosis. Each was investigated using equal
and unequal sample sizes to achieve a broad range of power. µ1 = µ 2 = 0 under
the null hypothesis. µ1 = 0 and µ2 = 0 .5 when the alternative hypothesis is true.
Other parameters were manipulated differently. For (1), normal data was simulated
with equal and unequal variances by letting σ1 = 1 and σ2 = 0.5, 1, 1.5. For (2),
skewed data was simulated assuming equal variance at 1, equal kurtosis at 0, and
equa
eq uall sk
skeewn
wnes
esss at 0, 0.
0.4,
4, 0.
0.8.
8. Si
Simi
mila
larl
rly
y, for (3
(3),
), ku
kurt
rtot
otic
ic da
data
ta wa
wass si
simu
mula
late
ted
d as
assu
sumi
ming
ng
equal variance at 1, equal skewness at 0, and equal kurtosis at 0, 5, 10. To visualize
the distributions from which the data were simulated, 10 6 data points were simu-
lated from the null distributions and used to create density plots (see Fig. 1). To best
visualize the distributions, the data range was truncated at 5 and − 5. The left panel
Fig. 1 Dis
Fig. Distrib
tributi
utions
ons to sim
simula
ulatete het
hetero
erosce
scedas
dastic
tic,, sk
skew
ewed,
ed, and kur
kurtot
totic
ic dat
dataa whe
whenn the nu
null
ll hy
hypot
pothes
hesis
is
of eq
equa
uall po
popu
pula
lati
tion
on me
meananss is tr
true
ue (left : no
norm
rmal
al di
dist
stri
ribu
buti
tion
onss wi
with
th me
mean
anss at 0 an
and
d st
stan
anda
dard
rd de
devi
viat
atio
ions
ns
at 0.5, 1, and 1.5; middle: distributions with means at 0, standard deviations at 1, skewness at 0, 0.4,
and 0.8, and kurtosis at 0; right : distributions with means at 0, standard deviations at 1, skewness
at 0, and kurtosis at 0, 5, and 10)
4 Results
The simulation results are presented in figures. In each figure, the type I error or
power is presented on the y -axis and the effective sample size per group to achieve
20%,
20 %, 40
40%,
%, 60
60%,%, an
and d 80
80%% po
powe
werr is pr
pres
esen
ente
ted
d on ththee x -ax
-axis
is ass
assumi
uming
ng the pooled t -test
pooled
is ap
appr
propr
opria
iate
te.. Th
Thee ef
effe
fect
ctiive sa
samp
mplele si
size
ze pe
perr gr
grou
oup p fo
forr tw
two-
o-sa
samp
mple
le me
mean
an co
comp
mpar
aris
ison
on
is defined as
2n 1 n 2
ne = . (14)
n1 + n 2
Each test is represented by a colored symbol. Given an effective sample size, the
results of
1. t -tests (pooled t -test, theoretical and permutation Welch’s t -tests, robust t -test)
2. Wilc
Wilcoxon
oxon rank-sum tests (theoretical and and permutation Wilcoxon
Wilcoxon rank-sum tests)
and the minimum p-value (minimum p-value of permutation Welch’s t -test and
Wilcoxon rank-sum test)
3. tw
two-o-st
stag
agee te
test
stss (no
(norm
rmal
alit
ity
y te
test
st wi
with
th th
thee α le
leve
vell at 0.
0.5,
5, 2.
2.5,
5, an
and
d 5% fo
forr bo
both
th sa
samp
mple
less
followed by Welch’s t -test or Wilcoxon rank-sum test)
are aligned in three columns from left to right. The results for n 1 = n 2 are in the left
panels and those for n 1 = 2 n 2 are in the right panels. In the figures that present type
I error results, two horizontal lines y = 0 .0457 and y = 0 .0543 are added to judge
whether the type I error is correct. The type I error is considered correct if it falls
within the 95% confidence interval of the 5% significance level (0.0457, 0.0543).
We use valid to describe tests that maintain a correct type I error and liberal and
conservative for tests that result in a type I error above and under the nominal level,
respectively.
1.5. All the valid tests have similar power. Tests with an inflated type I error tend to
be slightly more powerful. In contrast, tests that have a conservative type I error tend
to be slightly less powerful.
4.2 Skewness
The tw
The twoo sa
samp
mplele si
size
ze ra
rati
tios
os re
resu
sult
lt in si
simi
mila larr pa
patt
tter
erns
ns of ty
type
pe I ererro
rorr an
andd po
powe
werr. Al
Alll te
test
stss
main
ma inta
tain
in a co
corr
rrec
ectt ty
type
pe I er
erro
rorr ac
acro
ross
ss th
thee sp
specectr
trum
um of sasamp
mple le si
size
ze anandd sk
skeewn
wnesesss exc
xcep
eptt
2stage0.5, which has an inflated type I error when the sample size is insufficiently
large and the skewness is 0.8 (Fig. 4). A similar pattern for power is observed within
group
gro upss (t
(thr
hree
ee gr
grou
oups
ps in th
thre
reee co
colu
lumn
mns)s).. Wit ith
h an in
incr
crea
ease
se in sk
skeewn
wnes ess,
s, th
thee po
powe
werr of t -
tests (pooled t -test, theoretical and permutation Welch’ Welch’ss t -te
-test,
st, rob ust t -test) remains
robust
at th
thee le
leve
vell as pl
plan
anne
nedd an
and d is sl
slig
ight
htly
ly lo
lowe
werr th than
an ot
othe
herr te
test
stss wh
whenen ththee sk
skeewn
wnesesss re
reac
ache
hess
0.8 (Fig. 5).
4.3 Kurtosis
Based
Base d on Fi
Fig.
g. 6, al
alll te
test
stss ma
main
inta
tain
in a co
corr
rrec
ectt ty
type
pe I er
erro
rorr wh
when
en th
thee ef
effe
fect
ctiive sa
samp
mple
le si
size
ze
is eq
equa
uall or gr
grea
eate
terr th
than
an 25
25.. Wh
When
en th
thee ef
effe
fect
ctiive sa
samp
mplele si
size
ze is sm
smalall,
l, e.g. n e = 11
e.g. 11,, an
and
d
the kurtosis is 10, t -tests (with the exception of permutation Welch’ Welch’ss t -test) exhibit a
conservative type I error—the type I error of permutation Welch’s t -test is protected
with consistently similar power to other t -tests. Figure 7 displays tests within groups
(three groups in three columns) with similar power. While not apparent when the
sample size is small, t -tests are not as compelling as other tests when the skewness
is 5 or greater. Wilcoxon rank-sum tests are the best performing tests in this setting,
but the power gain is not significant when compared to the minimum p-value and
two-stage tests.
5 Disc
Discus
ussi
sion
on
The goal of this chapter was to conduct an evaluation of the variety of tests for two-
sample
sam ple loc
locati
ation
on com
compar
pariso
isonn usi
usingng Mon
Montete Car
Carlo
lo sim
simula
ulati
tion
on te
techn
chniqu
iques.
es. In con
conclu
clusio
sion,
n,
hete
he tero
roge
gene
neou
ouss var
aria
ianc
ncee is no
nott a pr
prob
oble
lem
m fo
forr an
any
y of th
thee te
test
stss wh
when
en th
thee sa
samp
mplele si
size
zess ar
aree
equal. For unequal sample sizes, Welch’ Welch’ss t -test maintains robustness against hetero-
geneous variance. The interaction between heterogeneous variance and sample size
is consi
consistent
stent with the stat
statement
ement of (Ko(Kohr
hr and Games 1974 1974)) for general two-sample
location comparison that a test is conservative when the larger sample has the larger
population variance and is liberal when the smaller sample has the larger population
variance. When the normality assumption is violated by moderately skewed or kur-
totic data, Welch’s t -tes
-testt and other t -tests maintain a correct type I error so long as
the sample size is sufficiently large. The t -tests, however, are not as powerful as the
minimum p-v
-valu
aluee is a rob
robust
ust opt
option
ion bu
butt com
comput
putati
ationa
onall
lly
y mor
moree ex
expen
pensi
sive
ve.. Oth
Otherw
erwise
ise,,
heterogeneous variance and deviation from a normal distribution must be weighed
in selecting between Welch’s t -test or the Wilcoxon rank-sum test. Alternatively,
Zimmerman and Zumbo recommended that Welchs t -test performs better than the
Wilcoxon rank sum test under conditions of both equal and unequal variance on data
that has been pre-ranked (Zimmerman and Zumbo 1993 1993).
).
Although moderately non-normal data has been the focus of this chapter, the con-
clusions are still useful as a general guideline. When data is extremely non-normal,
none of the tests will be appropriate. Data transformation may be applied to meet
the normality assumption Osborne (2005 (2005,, 2010
2010).
). Not every distribution, however,
can be transformed to normality (e.g. L-shaped distributions). In this scenario, data
dichotomization may be applied to simplify statistical analysis if information loss is
not a concern (Altman and Royston 2006 2006).
). Additional concerns include multimodal
distributions which may be a result of subgroups or data contamination (Marrero
1985)—a
1985 )—a few outliers can distort the distribution completely. All of these consid-
erations serve as a reminder that prior to selecting a two-sample location test, the
first key to success is full practical and/or clinical understanding of the data being
assessed.
References
Marrero, O. (1985). Robustness of statistical tests in the two-sample location problem. Biometrical
fol
follo
loww nea
nearly
rlygenerally
Statisticians an
any
y distri
distribu
butio
tion
regard n in a mul
multi
tiva
varia
riate
discretization teasset
settin
ating
g is
bad dic
dichot
idea hotomi
omized
on the zed or ordina
groundsord
ofinaliz
lized.
ed.
power,
inform
informati
ation,
on, and effec
effectt siz
sizee los
loss.
s. Despit
Despitee thi
thiss undeni
undeniabl
ablee disadv
disadvant
antage
age and legit
legitima
imate
te
criticism, its widespread use in social, behavioral, and medical sciences stems from
the fact that discretization could yield simpler, more interpretable, and understand-
able conclusions, especially when large audiences are targeted for the dissemination
of the research outcomes. We do not intend to attach any negative or positive con-
notations to discretization, nor do we take a position of advocacy for or against it.
The purpose of the current chapter is providing a conceptual framework and compu-
tational algorithms for modeling the correlation transitions under specified distribu-
ti
tiona
onall as
assu
sump
mpti
tion
onss with
within
in the
the re
real
alm
m of disc
discre
reti
tiza
zati
tion
on in th
thee cont
contex
extt of th
thee late
latenc
ncy
y and
and
threshold concepts. Both directions (identification of the pre-discretization correla-
tion value in order to attain a specified post-discretization magnitude, and the other
way around) are discussed. The ideas are developed for bivariate settings; a natural
extension to the multivariate case is straightforward by assembling the individual
correlation entries. The paradigm under consideration has important implications
and broad applicability in the stochastic simulation and random number generation
worlds. The proposed algorithms are illustrated by several examples; feasibility and
performance of the methods are demonstrated by a simulation study.
H. Demirtas (B)
Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago,
1603 West Taylor Street, Chicago, IL 60612, USA
e-mail: [email protected]
C. Vardar-Acar
Department of Statistics, Middle East Technical University, Ankara, Turkey
e-mail: [email protected]
[email protected]
1 Intr
Introd
oduc
ucti
tion
on
Unlike
Unli ke natu
natura
rall (tru
(true)
e) dich
dichot
otom
omieiess su
such
ch as ma
malele ve
vers
rsus
us fe
fema
male le,, co
cond
nduc
ucto
torr vers
versus
us in
insu
su--
lator, vertebrate versus invertebrate, and in-patient versus out-patient, some binary
variables are derived through dichotomization of underlying continuous measure-
ments. Such artificial dichotomies often arise across many scientific disciplines.
Exam
Ex amplples
es incl
includ
udee obesi
obesity
ty st
stat
atus
us (o
(obebese
se ve
vers
rsus
us non-
non-obe
obesese)) ba
base
sed d on body
body mass
mass in
inde
dex,
x,
preter
pre term
m versu
versuss term
term babies
babies give
givenn the ges
gestat
tation
ion per
period
iod,, high
high versu
versuss lo
low
w nee
needd of socia
sociall
interaction, small versus large tumor size, early versus late response time in sur-
veys, young versus old age, among many others. In the ordinal case, discretization
is equally commonly encountered in practice. Derived polytomous variables such
as young-
young-middl
middle-old
e-old age, low-med
low-medium-hi
ium-highgh income,
income, cold-cool-
cold-cool-av average
erage-hot
-hot tem-
perature, no-mild-moderate-severe depression are obtained based on nominal age,
income, temperature, and depression score, respectively. While binary is a special
case of ordinal, for the purpose of illustration, integrity, and clarity, separate argu-
ments are presented throughout the chapter. On a terminological note, we use the
words binary/dichotomous and ordinal/polytomous interchangeably to simultane-
ously reflect the preferences of statisticians/psychometricians. Obviously, polyto-
mous variables can normally be ordered (ordinal) or unordered (nominal). For the
remainder of the chapter, the term “polytomous” is assumed to correspond ordered
variables.
Discretization is typically shunned by statisticians for valid reasons, the most
prominent of which is the power and information loss. In most cases, it leads to a
dimi
dimini
nish
shed
ed effe
effect
ct si
size
ze as we
well
ll as re
redu
duce
ced
d re
reli
liab
abil
ilit
ity
y and
and stre
strengt
ngth
h of asso
associ
ciat
atio
ion.
n. How-
How-
ever, simplicity, better interpretability and comprehension of the effects of interest,
and superi
superiori
ority
ty of som
somee categ
categori
orica
call data
data measur
measures
es suc
such
h as odds
odds ratio
ratio have
have been
been argue
argued
d
by proponents of discretization. Those who are against it assert that the regression
paradigm is general enough to account for interacti
interactive
ve effects, outliers, ske
skewed
wed distri-
butions, and nonlinear relationships. In practice, especially substantive researchers
and practitioners employ discretization in their works. For conflicting views on rel-
ative perils and merits of discretization, see MacCallum et al. al . ((2002
2002)) and Farrington
and Loeber
Loeber (2000
2000),), respectively. We take a neutral position; although it is not a
recommended approach from the statistical theory standpoint, it frequently occurs
in practice, mostly driven by improved understandability-based arguments. Instead
of engaging in fruitless philosophical discussions, we feel that a more productive
effort can be directed towards finding answers when the discretization is performed,
which motivates the formation of this chapter’s major goals: (1) The determina-
tion of correlational magnitude changes when some of the continuous variables that
may marginally or jointly follow almost any distribution in a multivariate setting
are dichotomized or ordinalized. (2) The presentation of a conceptual and compu-
tational framework for modeling the correlational transformations before and after
discretization.
the
case,power polynomials
modeling (Fleishman
the correlation 1978;; can
1978
transition Valeonly
andbe Maurelli 1983).
1983iteratively
performed ). In the ordinal
under
the normality assumption (Ferrari and Barbiero 2012). 2012). Demirtas et al.
al. ((2016a
2016a)) aug-
mented the idea of computing the correlation before or after discretization when the
other one is specified and vice versa, to an ordinal setting. The primary purpose of
this chapter is providing several algorithms that are designed to connect pre- and
post-discretization correlations under specified distributional assumptions in simu-
lated environments. More specifically, the following relationships are established:
(a) tetrachor
tetrachoricic corre
correlati
lation/phi
on/phi coef
coefficie
ficient,
nt, (b) biserial/
biserial/point
point-bise
-biserial
rial correlati
correlations,
ons,
(c) polychoric correlation/ordinal phi coefficient, and (d) polyserial/point-polyserial
corr
co rrel
elat
atio
ions
ns,, wher
wheree (a
(a)–
)–(b
(b)) and
and (c
(c)–(
)–(d)
d) are
are re
rele
lev
vant
ant to bi
bina
nary
ry and
and or
ordi
dina
nall data
data,, re
resp
spec
ec--
tively; (b)–(d) and (a)–(c) pertain to situations where only one or both variables
is/are discretized, respectively. In all of these cases, the marginal distributions that
are needed for finding skewness (symmetry) and elongation (peakedness) values for
the underlying continuous variables, proportions for binary and ordinal variables,
and associational quantities in the form of the Pearson correlation are assumed to be
specified.
This work is important and of interest for the following reasons: (1) The link
between these types of correlations has been studied only under the normality
assumption; however, the presented level of generality that encompasses a com-
ables
Theare discretized in
organization of some studiesisand
the chapter remainedIncontinuous
as follows. in some others.
Sect. 2, necessary background
information is given for the development of the proposed algorithms. In particular,
how
ho w the correl
correlati
ation
on transf
transform
ormatation
ion wor
worksks for discre
discreti
tized
zed data
data throug
throughh numeri
numerica call inte-
inte-
gration and an iterative scheme for the binary and ordinal cases, respectively, under
the normality assumption,
assumption, is outli
outlined;
ned; a general
general nonnormal
nonnormal continuous
continuous data gener-
gener-
ation technique that forms a basis for the proposed approaches is described (when
both variables are discretized), and an identity that connects correlations before and
after discretization via their mutual associations is elaborated upon (when only one
variable is discretized). In Sect. 3, several algorithms for finding one quantity given
the other are provided under various combinations of cases (binary versus ordinal,
directionality in terms of specified versus computed correlation with respect to pre-
versus post-discretization, and whether discretization is applied on one versus both
varia
va riable
bles)
s) and some
some illust
illustrat
rativ
ivee ex
examp
amplesles,, rep
repres
resent
enting
ing a broad
broad range
range of distri
distribu
butio
tional
nal
shapes that can be encountered in real applications, are presented for the purpose of
exposition. In Sect. 4, a simulation study for evaluating the method’s performance
in a multivariate
multivariate setup by commonly acceptedaccepted accuracy (unbiasedne
(unbiasedness)
ss) measures in
both directions is discussed. Section 5 includes concluding remarks, future research
directions, limitations, and extensions.
2 Build
Buildin
ing
g Bloc
Blocks
ks
This section gives necessary background information for the development of the
propose
prop osedd algori
algorithm
thmss in modeli
modeling
ng correl
correlati
ation
on transi
transitio
tions.
ns. In what
what fol
follo
lows,
ws, cor
correl
relati
ation
on
type and rela
related
ted notation depend on the three factors: (a) before before or after
after discretiz
discretiza-
a-
tion, (b) only one or both variables is/are discretized,
discretized, and (c) discretized varivariable
able is
dichotomous or polytomous. To establish the notational convention, for the remain-
der of the chapter, let Y1 and Y2 be the continuous variables where either Y1 only
or both are discretized to yield X 1 and X 2 depending on the correlation type under
consideration (When Y’s are normal, they are denoted as Z , which is relevant for the
normal-based results and for the nonnormal extension via power polynomials). To
distinguish between binary and ordinal variables, the symbols B and O appear in the
subscr
sub script
ipts.
s. Fur
Furthe
thermo
rmore,
re, for avoid
avoiding
ing an
anyy confus
confusion symbolss B S , T E T , P S , and
ion,, the symbol
P O L Y are made a part of δ Y1 Y2 and δ Z 1 Z 2 to differentiate among the biserial, tetra-
choric, polyserial, and polychoric correlations, respectively. For easier readability,
we includ able 1 tha
includee Table thatt sho
shows
ws the specifi
specificc cor
correl
relati
ation
on types
types and associ
associatated
ed notati
notationa
onall
symbols based on the three above-mentioned factors.
tion for a standard bivariate normal random variable with correlation coefficient
[
δ Z 1 Z 2T E T . Obviously, Φ z 1 , z 2 , δ Z 1 Z 2T E T z1 z2
]=
−∞ −∞ f (z1 , z 2 , δ Z 1 Z 2T E T )d z1 d z 2 , where
f ( z 1 , z 2 , δ Z 1 Z 2T E T ) = [ 2π (1 − δ 2
) ]− × ex p − (z − 2δ
Z 1 Z 2T E T
1/2 1 2
1 Z 1 Z 2T E T z 1 z 2 + z )/ 2
2
(2(1 −δ 2
))
. The
The phi
phi co
coef
effic
ficient (δ X 1 B X 2 B ) and the tetrac
ient tetrachori
horicc correl
correlati
ation
on
Z 1 Z 2T E T
(δ Z 1 Z 2T E T ) are linked via the equation
1/2
[
Φ z ( p1 ), z ( p2 ), δ Z 1 Z 2T E T ]=δ X 1 B X 2 B ( p1 q1 p2 q2 ) +p p 1 2 (1)
generated,
and the binary
0 otherwise for j variables are derived
1, 2. While= RNG isby setting
not our ultimate z (1 work,
1 if Z j in this
X j B interest pj)
= ≥ −
we can use Eq. 1 for bridging the phi coefficient and the tetrachoric correlation.
When only one of the normal variables ( Z 1 ) is dichotomized, i.e., X 1 B I (Z1 =√ ≥
−
z (1 p1 ), it is relatively easy to show that δ X 1 B Z 2 /δ Z 1 Z B S
2
δ X1B Z1 h / p1 q1 = =
where h is
is the
the ordi
ordina
natete of the
the nor
norma
mall cu
curv
rvee at the
the po
poin
intt of di
dich
chot
otom
omiz
izat
atio
ion
n (Dem
(Demir irta
tass
and Hedeker 2016
2016).).
Real data often do not conform to the assumption
assumption of normality;
normality; hence most sim-
ulation studies should take nonnormality into consideration. The next section inves-
tigates the situation where one or both continuous variables is/are nonnormal.
Extending the limited and restrictive normality-based results to a broad range of dis-
tributional setups requires the employment of the two frameworks (an RNG routine
for multivariate continuous data and a derived linear relationship in the presence of
discretization), which we outline below.
We first tackle the relationship between the tetrachoric correlation δY1 Y2T E T and
the phi coefficient δ X 1 B X 2 B under nonnormality via the use of the power polynomials
(Fleishman 1978),
1978), whic
which h is a mo
momement
nt-m
-mat
atch
chin
ingg pr
proc
oced
edur
uree th
that
at simu
simula
late
tess no
nonn
nnor
orma
mall
distribut
distributions
ions often used in Monte-Carl
Monte-Carlo
o studi
studies,
es, based on the premise that real-
real-life
life
distributions of variables are typically characterized by their first four moments. It
hinges upon the polynomial transformation, Y a b Z c Z 2 d Z 3 , where Z
follows a standard normal distribution, and Y is standardized (zero mean and unit
= + + +
variance)..1 The distribution of Y depends on the constants a , b, c, and d , that can
variance)
be computed for specified or estimated values of skewness ( ν1 E Y 3 ) and excess
= [ ]
kurtosis (ν2 E Y4
= [ ]− 3). The procedure of expressing any given variable by the
sum of linear combinations of powers of a standard normal variate is capable of
1
We drop the subscript in Y as we start with the univariate case.
covering a wide area in the skewness-elongation plane whose bounds are given by
the general expression ν 2 ν12 2.2
≥ −
Assum ing that E Y
Assuming 0,and E Y 2
[ ]= [ ]=
1, by ut
util
iliz
izin
ing
g th
thee mo
mome
ment
ntss of th
thee stan
standa
dard
rd
normal distribution, the following set of equations can be derived:
= −c a (2)
2 2 2
b + 6bd + 2c + 15d − 1 = 0 (3)
2 2
2c(b + 24bd + 105d + 2) − ν = 0 1 (4)
2 2 2 2 2
24[bd + c (1 + b + 28bd ) + d (12 + 48bd + 141c + 225d )] − ν = 0 2 (5)
01 δ 0 1
0 3δ Z0 Z
R = E (z z ) 1 2 = 1 0 Z1 Z2
2δ 2Z 1 Z 2 +1 0
1 2 ,
0 3δ Z 1 Z 2 0 6δ 3Z 1 Z 2 + 9δ Z1 Z2
2
In fact, equality is not possible for continuous distributions.
3
δY1 Y2 is
is the
the sa me as δY1 Y T E T or δY1 Y P O L Y dep
same depend
ending
ing on if discre
discretiz
tized
ed va
varia
riable
bless are binary
binary or ordina
ordinal,
l,
2 2
respec
respectiv
tively
ely.. For the genera
generall presen
presentat
tation
ion of the po
power
wer polyno
polynomiamials,
ls, we do not make
make that
that distin
distincti
ction.
on.
2 3
δY1 Y2 =δ Z 1 Z 2 (b1 b2 + 3b d + 3d b + 9d d ) + δ
1 2 1 2 1 2 Z 1 Z 2 (2c1 c2 ) +δ Z 1 Z 2 (6d1 d2 ) (6)
Solving this cubic equation for δ Z 1 Z 2 gives the intermediate correlation between the
two standard normal variables that is required for the desired post-transformation
correlation δY1 Y2 . Clearly,
Clearly, correlations for each pair of variables should be assembled
into
into a mat
matrix
rix of interc
intercorr
orrela
elatio
tions
ns in the multi
multiva
varia
riate
te ca
case.
se. Fo
Forr a compre
comprehen hensi
sive
ve source
source
and detailed account on the power polynomials, see Headrick (2010 ( 2010).
).
In the dichot
dichotomi
omizat
zation
ion con
conte
text,
xt, the connec
connectio
tion
n betwee
between n the underly
underlyinging non
nonnor
normal
mal
(δY1 Y2 ) and normal correlations (δ Z 1 Z 2 ) in Eq. 6, along with the relationship between
the tetrachoric correlation ( δ Z 1 Z 2T E T ) and the phi coefficient (δ X 1 B X 2 B ) conveyed in
Eq. 1, is instrumental in Algorithms-1a and -1b in Sect. 3.
To address the situation where only one variable ( Y1 ) is dichotomized, we now
move to the relationship of biserial ( δY1 Y2B S ) and point-biserial (δ X 1 B Y2 ) correlations
in the absence of the normality assumption, which merely functions as a starting
point below. Suppose that Y1 and Y2 jointly follow a bivariate normal distribu-
tion with a correlation of δY1 Y2B S . Without loss of generality, we may assume that
both Y1 and Y2 are standardized to have a mean of 0 and a variance of 1. Let
X 1 B be the binary variable resulting from a split on Y1 , X 1 B = ≥
I (Y1 k ), where
[ ]=
k is the point of dichotomization. Thus, E X 1 B p1 and V X 1 B [ ]= p1 q1 where
q1 = −1 p1 . The cor correl
relati
ation
C
on betwee
ov [
between
X √
,Y1 ]
n X 1 B and Y1 , δ X 1 B Y1 ccan
√ an be obta
obtain
ined
ed in a simp
simple
le
namely, δ X 1 B Y1
way,, namely,
way √V X V Y 1 B
E X 1 B Y1 / p1 q1 E Y1 Y1 k / p1 q1 .
1B 1
2
where ε is independent of Y1 and Y2 , and follows N ∼ (0, 1 − δ Y1 Y2B S
). When we
generalize this to nonnormal Y1 and/or Y2 (both centered and scaled), the same
relationship can be assumed to hold with the exception that the distribution of ε
follows a nonnormal distribution. As long as Eq. 7 is valid,
[
C ov X 1 B , Y2 ] = C ov[ X , δ 1B Y1 Y2B S Y1 + ε]
= C ov[ X , δ 1B Y1 Y2B S Y ] + C ov[ X
1 1B , ε ]
= δ C ov[ X
Y1 Y2B S 1B , Y ] + C ov[ X
1 1B , ε] . (8)
In the
the at
curve biva
bivari
riat
the ate e norm
point noof
rmal case,, δ X 1 B Y1
al case
dichotomization. h
= √/ p1 q1 9where
Equation h is th
indicates thee or
that ordidina
the nate
te ofassociation
linear th
thee norm
normal al
between X 1 B and Y2 is assumed to be fully explained by their mutual association
with Y 1 (Demirtas and Hedeker 2016 2016).). The ratio, δ X 1 B Y2 /δY1 Y2B S is equal to δ X 1 B Y1 =
[ ]√
E X 1 B Y1 / p1 q1 = [ | ≥ ]√
E Y1 Y1 k / p1 q1 , which is a constant given p1 and the
distribut
distr ion of (Y1 , Y2 ). These
ibution These cor
correl relati
ations
ons are in
inva
varia
riant
nt to locat
locationion shifts
shifts and scalin
scaling,g,
Y1 and Y2 do notnot have
have to be cent
centerereded an
and
d sc
scal
aled
ed,, th
thei
eirr mean
meanss anandd varia
arianc
nces
es can
can take
take any
any
finite values. Once the ratio (δ X 1 B Y1 ) is found (it could simply be done by generating
Y1 and dichotomizing it to yield X 1 B ), one can compute the point-biserial ( δ X 1 B Y2 ) or
biserial δ Y1 Y2B S correlation when the other one is specified. This linearity-constancy
argument that jointly emanates from Eqs. 7 and 9, will be the crux of Algorithm-2
given in Sect. 3.
In the ordinal case, although the relationship between the polychoric correlation
(δY1 Y2P O L Y ) and the ordinal phi correlation ( δ X 1 O X 2 O ) can be written in closed form,
as explained below, the solution needs to be obtained iteratively even under the nor-
mality assumption since no nice recipe such as Eq. 1 is available. In the context of
correlated ordinal data generation, Ferrari and Barbiero (2012
(2012)) proposed an iterative
procedure based on a Gaussian copula, in which point-scale ordinal data are gener-
ated when the marginal proportions and correlations are specified. For the purposes
of this chapter, one can utilize their method to find the corresponding polychoric
correlation or the ordinal phi coefficient when one of them is given under normal-
ity. The algorithm in Ferrari and Barbiero (2012 ( 2012)) serves as an intermediate step in
formulating the connection between the two correlations under any distributional
assumption on the underlying continuous variables.
Concent
Conc entrat
rating
ing on the biva
bivaria
riate
te ca
case,
se, sup posee Z ( Z 1 , Z 2 )
suppos = ∼
N (0, ∆ Z 1 Z 2 ), wher
wheree
Z denotes the bivariate standard normal distribution with correlation matrix ∆ Z 1 Z 2
=
whose off-diagonal entry is δ Z 1 Z 2P O L Y . Let X ( X 1 O , X 2 O ) be the bivariate ordi-
nal data where underlying Z is discretized based on corresponding normal quan-
tiles given the marginal proportions, with a correlation matrix ∆ Z 1 Z 2 . If we need to
sample from a random vector ( X 1 O , X 2 O ) whose marginal cumulative distribution
functions (cdfs) are F1 , F2 tied together via a Gaussian copula, we generate a sam-
∼
ple ( z 1 , z 2 ) from Z =
N (0, ∆ Z 1 Z 2 ), then set x = (x1 O , x2 O ) ( F1−1 (u 1 ), F2−1 (u 2 ))
when u = (u , u ) = (Φ
1 (Φ((z ),Φ(z )), where Φ is the cdf of the standard normal
2 1 2
distribution. The correlation matrix of X , denoted by ∆ (with an off-diagonal X 1O X 2O
entry δ X 1 O X 2 O ) obviously differs from ∆ Z 1 Z 2 due to discretization. More specifically,
| | | |
δ X 1O X 2 O < δ Z 1 Z 2P O L Y in large
large sam
sample
ples.
s. The rel
relati
ations
onship
hip betwee
betweenn ∆ X 1 O X 2 O and ∆ Z 1 Z 2
is established resorting to the following formula (Cario and Nelson 1997 1997): ):
− −
∞ ∞ −1
1 1
E [ X 1 O X 2 O ] = E [ F1 (Φ( Z 1 )) F2 (Φ( Z 1 ))] = F1 (Φ( Z 1 )) F2−1 (Φ( Z 1 )) f (z 1 , z2 )d z 1 d z2 (10)
−∞ −∞
When
tion (δboth variables are ordinalized, the connection between the polychoric
shed bycorrela-
Y1 Y2P O L Y ) and the ordinal phi coefficient (δ X 1 O X 2 O ) can be established
establi a two-
stage scheme, in which we compute the normal, intermediate correlation ( δ Z 1 Z 2P O L Y )
from the ordinal phi coefficient by the method in Ferrari and Barbiero ((2012 2012)) (pre-
sented in Sect. 2.3) 2.3) before we find the nonnormal polychoric correlation via the
power polynomials (Eq. 6). The other direction (computing δ X 1 O X 2 O from δY1 Y2P O L Y )
can be implemented by executing the same steps in the reverse order. The associated
computational routines are presented in Sect. 3 ( Algorithms-3a and -3b ).
The correlational identity given in Eq. 9 holds for ordinalized data as well when
only one variable
variable is ordinalized
ordinalized (Demi
(Demirtas Hedekerr 2016);
rtas and Hedeke 2016); the ordinal version
of the equation can be written as δ X 1O Y2 = δY1 Y2P S δ X 1 O Y1 . The same linearity and
constancy of ratio arguments equally apply in terms of the connection between the
polyserial (δY1 Y2P S ) and point-polyserial (δ X 1 O Y1 ) correlations; the fundamental utility
and operational characteristics are parallel to the binary case. Once the ratio ( δ X 1 O Y1 )
is found by generati
generating ng Y 1 and discretizing it to obtain X 1 O , one can easily compute
eith
either
er of thes
thesee qu
quan
antititi
ties
es give
given
n the
the othe
otherr. Th
This
is will
will be pert
pertin inenentt in Algorithm-4 below.
The next section
section puts all these
these concept
conceptss togethe
togetherr from an algorithm algorithmic
ic point of
view with numerical illustrations.
3 Algorithms
Algorithms and Illustrati
Illustrative
ve Examples
Examples
We wo
workrk with
with ei
eigh
ghtt dist
distri
ribbutio
utions
ns to re
refle
flectct so
some
me comm
common on sh
shap
apes
es th
that
at ca
can
n be enco
encounun--
tered
tered in real-l
real-life
ife app
applic
licati
ations
ons.. The illust
illustrat
rativ
ivee ex
examp
amples
les com
comee from
from bibiva
varia
riate
te dat
dataa with
with
Weibull and Normal mixtur
mixturee mar
marginal
ginals. follows, W and N M sta
s. In what follows, nd for Weibull
stand
mixturee, respectively. The W density is f ( y γ , δ )
and Normal mixtur | = δ δ −1
γδ
y ex p( ( γy )δ )
−
for y > 0, an andd γ > 0 and δ > 0 are are the
the scal
scalee and
and shap
shapee para
parame mete
ters
rs,, re
resp
spec
ec--
tively.. The NM den
tively densit |
y is f ( y π , µ1 , σ1 , µ2 , σ2 )
sity = π
√ ex p
− −
1 y µ1 2
( σ )
+ √−
(1 π )
σ 2π
1
2 1 σ2 2π
ex p
− −
1 y µ2 2
( σ )
, where 0 < π < 1 is the
the mi
mixi
xing
ng para
paramemeteterr. Sinc
Sincee it is a mi
mixtxtur
ure,
e, it
2 2
can be unimodal or bimodal. Depending on the choice of parameters, both distribu-
ti
tions
ons ca
cann ta
takeke a varie
ariety
ty of shap
shapes
es.. We use
use fo
four
ur se
sets
ts of pa
para
rame
meteterr spec
specifiifica
cati
tion
onss fo
forr each
each
of the
these
se distri
distribu butio
tions: For W distribution, (γ , δ ) pa
ns: For pair
irss are
are ch
chos
osenen to be (1, 1), (1, 1.2),
(1, 3.6), and (1, 25), corresponding to mode at the boundary, positively skewed,
nearly symmetric, and negatively skewed shapes, respectively. For NM distribu-
tion, the parameter set (π , µ1 , σ1 , µ2 , σ2 ) is set to (0.5, 0, 1, 3, 1), (0.6, 0, 1, 3, 1),
(0.3, 0, 1, 2, 1),and (0.5, 0, 1, 2, 1), whose whose shapes
shapes are bimoda
bimodal-s l-symm
ymmetretric,
ic, bimoda
bimodal- l-
asymmetric, unimodal-negatively skewed, and unimodal-symmetric, respectively.
These four variations of the W and N M densities are plotted in Fig. 1 (W/NM:
the first/second columns) in the above order of parameter values, moving from
top to bottom. Finally, as before, p1 and p2 represent the binary/ordinal propor-
tions. In the binary case, they are single numbers. In the ordinal case, the marginal
proportions are denoted as P ( X i = = j) pi j for i = 1, 2 and j = 1, 2,...ki , and
pi = ..., pi ki ), in whic
( pi 1 , p i 2 , ..., whichh skip
skip pa
patt
tter
erns
ns are
are allo
allowe
wed.d. Fu
Furt rthe
herm
rmorore,
e, if th
thee user
userss
wish
wi sh to st
star
artt the
the ordi
ordinanall cate
catego
gori
ries
es fr
from
om 0 or any
any ininte
tege
gerr ot
othe
herr th
than
an 1, th
thee asso
associciat
atio
iona
nall
implications remain unchanged as correlations are inv invariant
ariant to the location shifts. Of
note, the number of significant digits reported throughout the chapter varies by the
computational sensitivity of the quantities.
Algorithm-1a: Computing the tetrachoric correlation correlation ( δY1 Y2T E T ) from the phi coef-
ficient ( δ X 1 B X 2 B ): The algorithm for computing δY1 Y2T E T when δ X 1 B X 2 B , p1 , p2 , and the
key distributional characteristics of Y1 and Y2 ( ν1 and ν2 ) are specified, is as follows:
0
.
1
−
δX 1B X 2B
Algorithm-2: Computing the biserial corre lation (δY1 Y B S ) from the point-biserial
correlation
2
correlation (δ X 1 B Y2 ) and the other way around : One only needs to specify the distri-
butional form of Y1 (the variables that is to be dichotomized) and the proportion ( p1 )
2.2).
for this algorithm (See Sect. 2.2). The steps are as follows:
1. Gen
Genera
erate
te Y 1 with a large number of data points (e.g., N 100, 000). =
2. Dichotomizee Y1 to obtain X 1 B through the specified value of p1 , and compute the
Dichotomiz
sample correlation, δ X 1 B Y1 ĉc1 . =ˆ
BS BS
Find δ X 1 B Y2 or δ Y1 Y2 by δ X 1 B Y2 /δY1 Y2
3. Find =ˆ ĉc1 by Eq. 9.
In this illustration, we assume that Y1 ∼ W (1, 1.2), Y2 ∼ N M (0.6, 0, 1, 3, 1), and
δY1 Y2B S= 0.60. Y1 is dichotomized to obtain X 1 B where E ( X 1 B ) p1 = =
0.55. After
follo
fol lowi
wing
ng Step
Stepss 1 anand ˆ
d 2, ĉc1 tur
turnsout
nsout to be0 .710, and accordi accordinglnglyy δ X 1 B Y2 =ˆ
cĉ1 δY1 Y B S
2
=
0.426. Similarly, if the specified value of δ X 1 B ,Y2 is 0.25, then δ Y1 Y2B S = ˆ=
δ X 1 B Y2 /cĉ1
0.352. The fundamental ideas remain the same if Y2 is dichotomized (with a pro-
portion p2 ) rather than Y1 . In that case, with a slight notational difference, the new
equations would be δ X 2 B Y2 =ˆ =ˆ
ĉc2 and δ X 2 B Y1 /δY1 Y2B S ˆ ˆ
cĉ2 . Table 3 shows cĉ1 and cĉ2 val-
ues whe
when n p 1 or p 2 ran
ranges
ges bet
betwee
ween n 0.05and0.95withanincrementof0.10. We furthe furtherr
generated bivariate continuous data with the above marginals and the biserial corre-
−
lations between 0.85 and 0.90 with an increment of 0 .05. We then dichotomized
Y1 where p1 is 0.15 and 0.95, and computed the empirical point-biserial correlation.
The lower- and upper-right graphs in Fig. 3 the plot of the algorithmic value of cĉ1 in ˆ
Step 2 and δY1 Y2B S versus δ X 1 B Y2 , respectively, where the former is a theoretical and
δ X 1 B Y2 in the latter is an empirical quantity. As expected, the two cĉ1 values are the ˆ
same as the slopes of the linear lines of δY1 Y2B S versus δ X 1 B Y2 , lending support for how
plausibly Algorithm-2 is working. The procedure is repeated under the assumption
that Y 2 is dichotomized rather than Y 1 (lower graphs in Fig. 3).
Algorithm-3a: Computing the polychoric correlation (δY1 Y2P O L Y ) from the ordinal
phi coefficient (δ X 1O X 2 O ): The algorit
algorithm
hm for computin
computing g δ Y1 Y2P O L Y when δ X 1O X 2 O , p1 ,
6
.
.8 0
0
.6 2
.
0 2 0
Y
B
c1 1
X
.4 δ 2
.
0 0
−
.2
0 6
.
0
−
δ Y1 Y2BS δ Y1 Y2BS
6
.
0
.8
0
2
.
.6 0
0 Y
1
B
c2 2
X
.4 δ
2
0 .
0
−
.2
0
6
.
0
−
−0.5 0 .0 0 .5 −0.5 0.0 0 .5
δ BS δ BS
Y1 Y2 Y1 Y2
ˆ ˆ
Fig. 3 Plots of ĉc1 (upper-left), ĉc2 (lower-left), empirical δ X 1 B Y2 versus δ Y1 Y B S (upper-right), and
2
δ X 2 B Y1 versus δY1 Y B S for p1 or p 2 is 0.15 (sho
2
(shown
wn by o) or 0.95 (sho
(shown
wn by *)
*),, wher
wheree Y1 ∼ W (1, 1.2)
and Y 2 ∼ N M (0.6, 0, 1, 3, 1)
Table 4 Comput
Computed ed va
value
luess of the polych
polychori
oricc correl
correlation (δY1 Y P O L Y ) or th
ation thee or
ordi
dina
nall ph
phii co
coef
ef--
2
∼
ficient (δ X 1 O X 2 O ) given the other, with two sets of proportions for Y1 W (1, 3.6) and Y2 ∼
N M (0.3, 0, 1, 2, 1)
p1 p2 δ X 1O X 2 O δ Z1 Z P O L Y δY1 Y P O L Y
2 2
1. Comput
Computee the powepowerr coeffi
coefficien
cients ts ( a , b, c, d ) for Y 1 and Y 2 by Eqs. 2–5.
2. Solve Eq. 6 for δ Z 1 Z 2T E T .
Solve
3. Solve for δ X 1 O X 2 O given δ Z 1 Z 2T E T by the method in Ferrari and Barbiero (2012
Solve (2012).
).
Withith the same
same set of speci
specifica ficatio
tions,ns, namel
namely y, Y1∼ W (1, 3.6), Y2 ∼ N M (0.3, 0, 1,
=
2, 1), and ( p1 , p2 ) ((0.4, 0.3, 0.2, 0.1), (0.2, 0.2, 0.6)), suppose δY1 Y2P O L Y = 0.5.
Afte
Af terr solv
solvin
ingg for
for the
the po
powe
werr coef
coefficficieientntss (Ste
(Stepp 1)
1),, Step
Stepss 2 an
andd 3 yi
yiel
eld =
d δ Z 1 Z 2P O L Y 0.502
and δ X 1 O X 2 O = 0.368
368.. Simil
Similarl arly
y, whe when =
n ( p1 , p2 ) ((0.1, 0.1, 0.1, 0.7), (0.8, 0.1,
0.1)) and δY1 Y2P O L Y = 0.7, δ Z 1 Z 2P O L Y= 0.702 and δ X 1 O X 2 O = 0.263. The lower half
of Table 4 includes a few more combinations. A more inclusive set of results is
given in Fig. 4, which shows the relative trajectories of δ X 1O X 2 O and δY1 Y2P O L Y when
the prop
proportortion
ion set
setss take
take thrthree
ee differ
different ent va
value
lues,s, wit
withh the add
additi on of ( p1 , p2 )
ition =
((0.25, 0.25, 0.25, 0.25), (0.05, 0.05, 0.9)) to the two sets above.
Algorithm-4: Computing the polyserial correl ation (δY1 Y2P S ) from
correlation from the the po
poin
int-
t-
polyserial corre lation (δ X 1 O Y2 ) an
correlation andd ththee ot
othe
herr wa
wayy arou
aroundnd: Th
Thee fo
foll
llow
owining g step
stepss en
enab
able
le
us to calculate either one of these correlations when the distribution of Y 1 (the vari-
able that is subsequently ordinalized) and the ordinal proportions ( p1 ) are specified
(See Sect. 2.42.4)):
Fig. 4 δ X 1 O X 2 O versus .0
1
∼
δY1 Y P O L Y for Y 1 W (1, 3.6)
2
and Y 2 ∼ N M (0.3, 0, 1,
2, 1), where solid, dashed,
and dotted curves represent .5
0
=
( p1 , p 2 ) ((0.4, 0.3,
0.2, 0.1), (0.2, 0.2, 0.6)),
Y
L
((0.1, 0.1, 0.1, 0.7), (0.8, O
.0
P2
Y
0.8, 0.1)), and (( 0.25, 0.25, Y
1
0
δ
0.25, 0.25), (0.05, 0.05,
0.9)), respectively; the range
differences are due to the .5
0
Fréchet-Hoeffding
Fréchet-Hoeff ding bounds −
.0
1
−
δX1OX2O
1. Gen
Genera
erate
te Y 1 with a large number of data points (e.g., N 100, 000).=
2. Ordina lize Y 1 to obtain X 1 O throug
Ordinalize through h the specified value of p1 , and compute the
specified value
sample correlation, δ X 1O Y1 =ˆ ĉc1 .
Find δ X 1 O Y2 or δ Y1 Y2P S by δ X 1 O Y2 /δY1 Y2P S
3. Find =ˆ
ĉc1 by Eq. 9.
able 5 Co
Table Compu
mputed
ted va
value
luess of ĉc1 ˆ = ˆ =
δ X 1 O Y1 and cĉ2 δ X 2 O Y2 th that
at co
conn
nnec
ectt th
thee po
poly
lyse
seri
rial
al
∼
(δY1 Y B S ) and point-polyserial correlations ( δ X 1 O Y2 or δ X 2 O Y1 ), where Y1 W (1, 25) and Y2
2
∼
N M (0.5, 0, 1, 2, 1)
p1 p2 ˆ
ĉc1 ˆ
cĉ2
(0.4, 0.3, 0.2, 0.1) – 0.837 –
(0.1, 0.2, 0.3, 0.4) – 0.927 –
(0.1, 0.4, 0.4, 0.1) – 0.907 –
(0.4, 0.1, 0.1, 0.4) – 0.828 –
(0.7, 0.1, 0.1, 0.1) – 0.678 –
– (0.3, 0.4, 0.3) – 0.914
– (0.6, 0.2, 0.2) – 0.847
– (0.1, 0.8, 0.1) – 0.759
– (0.1, 0.1, 0.8) – 0.700
– (0.4, 0.4, 0.2) – 0.906
.5
0
.0
0
.5
0
−
3
0
.0
0
0
0
.0
0
3
0
.0
0
−
2016c)) can also be employed. The root of the third order polynomials in Eq. 6 was
2016c
found by polyroot function in the base package. The tetrachoric correlation and
the phi coefficient in Eq. 1 was computed by phi2tetra function in psych pack-
2016)) and pmvnorm function in mvtnorm package (Genz et al.
age (Revelle 2016 al. 2016
2016),
),
respectively. Finding the polychoric correlation given the ordinal phi coefficient and
the opposite direction were performed by ordcont and contord functions in
GenOrd package (Barbiero and Ferrari 2015),
2015), respectively.
4 Simulatio
Simulations
ns in a Multiva
Multivariate
riate Setting
Setting
By the
the probl
problemem defin
definititio
ion
n an
andd desi
design
gn,, all
all de
deve
velo
lopm
pmen
entt has
has be
been
en pr
presesen
ente
tedd in bi
biva
vari
riat
atee
sett
settin
ings
gs.. For
For asasse
sess
ssin
ingg ho
howw the
the algo
algori
rith
thms
ms work
work in a broad
broaderer mu
mult ltiivaria
ariate
te co
cont
nteext and
and
for highlighting the generality of our approach, we present two simulation studies
that involve the specification of either pre- or post-discretization correlations.
Simu
Simulalati
tion
on work
work is devi
devise
sedd aroun
around
d fiv
fivee co
cont
ntin
inuo
uous
us vari
variab
able
les,
s, an
and d four
four of th
thes
esee are
are
subseq
sub sequen
uently
tly dic
dichot
hotomi
omized
zed or ord
ordina
inaliz
lized.
ed. Referr
Referring
ing to the Weibull and Normal mix-
ture den
densit
sities
ies in the illust
illustrat
rativ
ivee ex
examp
amples
les,, the distri
distribu
butio
tional
nal forms
forms are as fol follows: Y1
lows: ∼
W (1, 3.6), Y2 ∼ W (1, 1.2), Y3 ∼ N M (0.3, 0, 1, 2, 1), Y4 ∼ N M (0.5, 0, 1, 2, 1),
and Y5 ∼ N M (0.5, 0, 1, 3, 1). Y1 , ...,
..., Y4 are to be dis discre
cretiz
tized
ed with
with proport
proportion
ionss
p 0.6, p 0.3, p (0.4, 0.2, 0.2, 0.2), and p (0.1, 0.6, 0.3), respectively
respectively..
1 = 2 = 3 = 4 =
Two dichotomized (Y1 and Y2 ), two ordinalized (Y3 and Y4 ), and one continuous (Y5 )
variables form a sufficient environment,
environment, in which all types of correlations mentioned
in this work are covered. For simplicity, we only indicate if the correlations are pre-
or post-discretization quantities without distinguishing between different types in
terms of naming and notation in this section. We investigate both directions: (1) The
pre-discretization correlation matrix is specified; the theoretical (algorithmic) post-
discretization quantities were computed; data were generated, discretized with the
prescription guided by the proportions, and empirical correlations were found via
n = 1000
10 00 si
simu
mulalati
tion
on re
repl
plic
icat
ates
es to se
seee how
how clos
closel
ely
y th
thee algo
algori
rith
thmi
micc an
and
d empi
empiririca
call va
val-
l-
ues are aligne
alignedd on avera
average.
ge. (2) The pos
post-d
t-disc
iscret
retiza
izatio
tion
n matrix
matrix is spe
specifi
cified;
ed; correl
correlati
ation
on
among latent variables were computed via the algorithms; data were generated with
this correlation matrix; then the data were dichotomized or ordinalized to gauge if
we obtain the specified post-discretization correlations on average. In Simulation 1,
the pre-di
pre-discr
screti
etizat
zation
ion cor
correl
relati
ation
on matrix
matrix (Σ pr e ) repres
represent
enting
ing the correl
correlati
ation
on struct
structure
ure
among continuous variables, is defined as
1 00 0 14 −0 32 0 56 0 54
. . . . .
0 14 1 00 −0 10 0 17 0 17
. . . . .
Σ pr e = −0 32 −0 10 1 00 −0 40 −0 38
. . . . . ,
0 56 0 17 −0 40 1 00 0 67
. . . . .
0.54 0.17 −
0.38 0.67 1.00
dis
discre
erlycreti
tizat
and zation
ion the
yield va
value
luess (un
true (under
der the
values) ass
assump
were umptio
tion
n that
computed. that thespecifically,
More algori
algorithm
thmss Σ
functio
function n prop-
prop-
[ ]
postt 1, 2 was
pos
[ ]
found by Algorithm-1b, Σ pos t 1, 5 and Σ pos [ ]
postt 2, 5 by Algorithm-2, Σ pos [ ]
postt 1, 3 ,
[ ] [ ] [ ] [ ]
postt 2, 4 , and Σ pos t 3, 4 by Algorithm-3b, Σ pos t 3, 5
Σ pos t 1, 4 , Σ pos t 2, 3 , Σ pos [ ]
[ ]
and Σ pospostt 4, 5 by Algorithm-4. These values collectively form a post-discretization
correlation matrix (Σ pos t ), which serves as the True Value (TV). The empirical post-
discretiz
discr etizatio
ation n corre
correlati
lation
on estimate
estimatess were calculate
calculated d after generatingN
generating =
1, 000 ro rows
ws
Σ [4, 5] 0..67
0 0.58528 0.58773 0.00245 0.42 1.28
The corres
correspon
pondin
ding g pre
pre-di
-discr
screti
etizat
zation
ion mat
matrix
rix was
was foun
found
d via the alg
algori
orithm
thms.
s. The
[ ] [ ]
theoretical Σ pr e 1, 2 was computed by Algorithm-1a, Σ pr e 1, 5 and Σ pr e 2, 5 [ ]
[ ] [ ] [ ] [ ]
by Algorithm-2, Σ pr e 1, 3 , Σ pr e 1, 4 , Σ pr e 2, 3 , Σ pr e 2, 4 , and Σ pr e 3, 4 by [ ]
[ ] [ ]
Algorithm-3a, Σ pr e 3, 5 and Σ pr e 4, 5 by Algorithm-4. These values jointly form
a pre-discret
pre-discretizat
ization
ion corre
correlati
lation
on matr
matrix
ix (Σ pr e ). The empirica
empiricall post-discr
post-discretiz
etizatio
ation
n
correlation estimates
estimates were calcula
calculated
ted after generating N
generating = 1, 000 rows of multivar
multivari-
i-
ate latent, continuous data (Y1 , .. ., Y5 ) by the computed Σ pr e before discretization of
...,
(Y1 , ..., Y4 ). As before, this process is repeated for n
..., = 1, 000 times. In Table 7, we
tabulate T V (Σ pos postt ), Σ pr e , A E , R B , P B , and S B . Again, the discrepancies between
the expected and empirical quantities are minimal by the three accuracy criteria,
providing substantial support for the proposed method.
These results indicate compelling and promising evidence in favor of the algo-
rithms herein. Our evaluation is based on accuracy (unbiasedness) measures. Pre-
cision is another
estimates important
(Demirtas et al. criterion
al. 2008;
2008 in terms
; Demirtas andof the quality
Hedeker and
2008b;
2008b performance
; Yucel of the
and Demirtas
2010
2010).). We address the precision issues by plotting the correlation estimates across
all simulation replicates in both scenarios (Fig. 6). The estimates
estimates closely match the
true values shown in Tables 6 and 7 7,, with a healthy amount of variation that is within
the limits of Monte-Carlo simulation error. On a cautious note, however,
however, there seems
to be slightly more variation in Simulation 2, which is natural since there are two
layers of randomness (additional source of variability).
5 Disc
Discus
ussi
sion
on
.6 .5
0 0
.4
0
.4
0
.3
0
.2
0 .2
0
.1
.0 0
0
.0
0
.2
0
−
.1
0
−
.4 .2
0 0
− −
setting. Nonnormality is handled by the power polynomials that map the normal and
nonnormal correlations. The approach works as long as the marginal characteristics
(skewness and elongation parameters for continuous data and proportion values for
binary/ordinal data) and the degree of linear association between the two variables
are legitimately defined, regardless of the shape of the underlying bivariate contin-
uous density. When the above-mentioned quantities are specified, one can connect
correlations before and after discretization in a relatively simple manner.
Onee pote
On potent
ntia
iall limi
limita
tati
tion
on is that
that po
powe
werr po
polylynom
nomia ials
ls co
coveverr mo
most
st of th
thee fe
feas
asib
ible
le sym-
sym-
metry-peak
metry ednesss plane (ν2 ν12 2),
-peakednes ≥ − 2), bu
butt no
nott en
enti
tire
rely
ly.. In an atte
attemp
mptt to sp
span
an a lar
large
gerr
space, one can utilize the fifth order polynomial systems (Demirtas 2017; 2017; Headrick
2002),
2002 ), although it may not constitute an ultimate solution. In addition, a minor con-
cern could be that unlike binary data, the marginal proportions and the second order
product moment (correlation) do not fully define the joint distribution for ordinal
data. In other words, odds ratios and correlations do not uniquely determine each
other. However, in overwhelming majority of applications, the specification of the
first and second order moments suffices for practical purposes; and given the scope
of this
this work
work,, whic
which h is mo
modedeli
ling
ng the
the tr
tran
ansi
siti
tion
on betw
betweeeen n di
difffe
fererent
nt pa
pair
irss of corr
correl
elat
atio
ions
ns,,
this complication is largely irrelevant. Finally, the reasons we base our algorithms
on the Pearson correlation (rather than the Spearman correlation) are that it is much
more common in RNG context and in practice; and the differences between the two
are negligibly small in most cases. Extending this method for encompassing the
Spearman correlation will be taken up in future work resorting to a variation of the
sorting idea that appeared in Demirtas and Hedeker (2011
( 2011),
), allowing us to capture
any monotonic relationship in addition to the linear relationships. On a related note,
furth
further
er expa
expans
nsio
ions
ns can
can be imag
imagin
ined
ed to acco
accomm
mmoda
odate
te mo
more
re comp
comple lex
x asso
associciat
atio
ions
ns th
that
at
involve higher order moments.
The positive characteristics and salient advantages of these algorithms are as
follows:
• They work for an extensive class of underlying bivariate latent distributions
whose components are allowed to be non-identically distributed. Nearly all contin-
uous shapes and skip patterns for ordinal variables are permissible.
The req
requir
uired
ed softwa
software
re too
tools
ls for the implem
implement
entati
ation
on are rather
rather bas
basic
ic,, users
users mer
merel
ely
y
•
need a computational platform with numerical double integration solver for the
binary-binary case, univariate RNG capabilities for the binary/ordinal-continuous
case, an iterative scheme that connects the polychoric correlations and the ordi-
nal phi coefficients under the normality assumption for the ordinal ordinal case, a
polynomial root-finder and a nonlinear equations set solver to handle nonnormal
continuous variables.
•
The des
descri
cripti
ption
on of the connec
connectio
tion
n bet
betwee
ween
n the two
two correl
correlati
ations
ons is nat
natura
urally
lly gi
give
venn
for the biva
bivaria
riate
te case.
case. The mul
multi
tiva
varia
riate
te ex
exten
tensio
sion
n is easily
easily man
manage
ageabl
ablee by assemb
assemblinling
g
the individual correlation entries. The way the techniques work is independent of the
number of variables; the curse of dimensionality is not an issue.
•The algorithms could be conveniently used in meta-analysis domains where
some studies discretize variables and some others do not.
•Assessing the magnitude of change in correlations before and after ordinaliza-
tion is likely to be contributory in simulation studies where we replicate the speci-
fied trends especially when simultaneous access to the latent data and the eventual
binary/ordinal data is desirable.
•One can more rigorously fathom the nature of discretization in the sense of
knowing how the correlation structure is transformed after dichotomization or ordi-
nalization.
•The proposed procedures can be regarded as a part of sensible RNG mecha-
nisms to generate multivariate latent variables as well as subsequent binary/ordinal
variables given their marginal shape characteristics and associational structure in
simulated environments, potentially expediting the development of novel mixed
data generation routines, especially when an RNG routine is structurally involved
with generating multivariate continuous data as an intermediate step. In conjunc-
tion with the published works on joint binary/normal (Demirtas and Doganay 2012 2012),),
binary/nonnormal continuous (Demirtas et al. al. 2012),
2012), ordinal/normal (Demirtas and
Yavuz 2015),
2015), count/normal (Amatya and Demirtas 2015), 2015), and multivariate ordinal
2006),
data generation (Demirtas 2006 ), the ideas presented in this chapter might serve as
a milestone for concurrent mixed data generation schemes that span binary, ordinal,
count, and nonnormal continuous data.
•
These
The se algori
algorithm
thmss ma
may
y be instru
instrumen
mental
tal in de
deve
velop
loping
ing multip
multiple
le imp
imputa
utati
tion
on strat
strate-
e-
gies for mixed longitudinal or clustered data as a generalization of the incomplete
References
Allozi, R., & Demirt
Allozi, Demirtas,
as, H. (2016)
(2016).. Modeli
Modelingng Correl
Correlati
ationa
onall Magnit
Magnitude
ude Tr
Trans
ansfor
format
mation
ionss in
Discretiza
Discr etization
tion Contexts, package CorrToolBox
Contexts, R package CorrToolBox.. https://cran.r-project.org/web/packages/
CorrToolBox..
CorrToolBox
Amatya
Ama tya,, A., & Demirt
Demirtas,
as, H. (2015)
(2015).. Simulta
Simultaneo
neous
us genera
generatio
tionn of multi
multiva
varia
riate
te mixed
mixed datadata
with Poisson and normal marginals. Journal of Statistical Computation and Simulation , 85,
3129–3139.
Barbiero, A., & Ferrari, P.A. (2015). Simulation of Ordinal and Discrete Variables with Given
Correlation Matrix and Marginal Distributions, R package GenOrd GenOrd.. https://cran.r-project.org/
web/packages/GenOrd..
web/packages/GenOrd
Cario, M. C., & Nelson, B. R. (1997). Modeling and generating random vectors with arbitrary
margina
mar ginall distribut
distributions
ions and correlatio
correlation
n matrix (T
(Techn
echnical
ical Repor t). Depart
Report) Departmen
mentt of Indust
Industria
riall Engi-
Engi-
neering and Management Services: Northwestern University, Evanston, IL, USA.
Demirtas, H. (2004a). Simulation-driven inferences for multiply imputed longitudinal datasets.
Statistica Neerlandica, 58, 466–482.
Demirtas, H. (2004b). Assessment of relative improvement due to weights within generalized esti-
mating equations framework for incomplete clinical trials data. Journal of Biopharmaceutical
Statistics, 14, 1085–1098.
Demirtas, H. (2005). Multiple imputation under Bayesianly smoothed pattern-mixture models for
non-ignorable drop-out. Statistics in Medicine, 24, 2345–2363.
Demirta
Dem irtas,
s, H. (2006)
(2006).. A method
method for multi
multiva
varia
riate
te ordina
ordinall data
data genera
generatio
tion
n gi
give
ven
n margin
marginal al distrib
distributi
utions
ons
and correlations. Journal of Statistical Computation and Simulation, 76 , 1017–1025.
Demirtas, H. (2007a). Practical advice on how to impute continuous data when the ultimate inter-
est centers
centers on dichotomi
dichotomized
zed outcomes
outcomes through
through pre-specifi
pre-specifieded threshold
thresholds.
s. Communic
Communicationsations in
Statistics-Simulation and Computation, 36 , 871–889.
Demirtas, H. (2007b). The design of simulation studies in medical statistics. Statistics in Medicine,
26,irtas,
Dem 3818–3821.
Demirta s, H. (2008)
(2008).. On imputi
imputing
ng contin
continuou
uouss data
data when
when the ev
event
entual
ual intere
interest
st pertai
pertains
ns to ordina
ordinaliz
lized
ed
outcomes via threshold concept. Computational Statistics and Data Analysis, 52 , 2261–2271.
Demirtas, H. (2009). Rounding strategies for multiply imputed binary data. Biometrical Journal,
51, 677–688.
Demirtas, H. (2010). A distance-based rounding strategy for post-imputation ordinal data. Journal
of Applied Statistics, 37 , 489–500.
Demirtas, H. (2016). A note on the relationship between the phi coefficient and the tetrachoric
correlation under nonnormal underlying distributions. American Statistician, 70, 143–148.
Demirtas, H. (2017). Concurrent generation of binary and nonnormal continuous data through
fifth order power polynomials, Communications in Statistics- Simulation and Computation. 46 ,
344–357.
Demirta
Dem irtas,
s, H., Ahmadi
Ahmadian, an, R., Atis,
Atis, S., Can,
Can, F. E., & Ercan,
Ercan, I. (2016a
(2016a).
). A nonnor
nonnormal
mal look
look at polych
polychori
oricc
correlations: Modeling the change in correlations before and after discretization. Computational
Statistics, 31, 1385–1401.
Demirtas, H., Arguelles, L. M., Chung, H., & Hedeker, D. (2007). On the performance of bias-
reduction techniques for variance estimation in approximate Bayesian bootstrap imputation.
Computational Statistics and Data Analysis, 51 , 4064–4068.
Demirtas, H., & Doganay, B. (2012). Simultaneous generation of binary and normal data with
specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–
236.
Demirta
Dem irtas,
s, H., Freels
Freels,, S. A., & Yucel,
ucel, R. M. (2008)
(2008).. Plausi
Plausibil
bility
ity of multi
multiva
varia
riate
te no
norma
rmalit
lity
y assump
assumptio
tion
n
when multiply imputing non-Gaussian continuous outcomes: A simulation assessment. Journal
of Statistical Computation and Simulation , 78 , 69–84.
Demirtas, H., & Hedeker,
Hedeker, D. (2007). Gaussianization-based quasi-imputation and expansion strate-
gies for incomplete correlated binary responses. Statistics in Medicine, 26, 782–799.
Demirtas, H., & Hedeker, D. (2008a). Multiple imputation under power polynomials. Communica-
Communica
tions in Statistics- Simulation and Computation, 37 , 1682–1695.
Demirtas, H., & Hedeker, D. (2008b). Imputing continuous data under some non-Gaussian distrib-
utions. Statistica Neerlandica, 62, 193–205.
Demirtas, H., & Hedeker, D. (2008c). An imputation strategy for incomplete longitudinal ordinal
data. Statistics in Medicine, 27, 4086–4093.
Demirtas, H., & Hedeker, D. (2011). A practical way for computing approximate lower and upper
correlation bounds. The American Statistician, 65, 104–109.
Demirta
Dem irtas,
s, H., & Hedek
Hedekerer,, D. (2016
(2016).
). Comput
Computing
ing the point-
point-bis
biseri
erial
al correl
correlati
ation
on under
under an
any
y underl
underlyin
ying
g
continuous distribution. Communications in Statistics- Simulation and Computation, 45 , 2744–
2751.
Demirtas,
by power H., Hedeker, D.,Statistics
polynomials. & Mermelstein, J. M.
in Medicine (2012).
, 31 Simulation of massive public health data
, 3337–3346.
Demirtas, H., & Schafer, J. L. (2003). On the performance of random-coefficient pattern-mixture
models for non-ignorable drop-out. Statistics in Medicine , 22 , 2553–2575.
Demirtas, H., Shi, Y., & Allozi, R. (2016b). Simultaneous generation of count and continuous data,
R package PoisNonNor. https://cran.r-project.org/web/packages/PoisNonNor.
https://cran.r-project.org/web/packages/PoisNonNor .
Demirtas, H., Wang, Y., & Allozi, R. (2016c) Concurrent generation of binary, ordinal and contin-
https://cran.r-project.org/web/packages/BinOrdNonNor..
uous data, R package BinOrdNonNor. https://cran.r-project.org/web/packages/BinOrdNonNor
Demirtas, H., & Yavuz, Y. (2015). Concurrent generation of ordinal and normal data. Journal of
Biopharmaceutical Statistics, 25, 635–650.
Emrich, J. L., & Piedmonte, M. R. (1991). A method for generating high-dimensional multivariate
binary variates. The American Statistician, 45 , 302–304.
Farrington, D. P., & Loeber, R. (2000). Some benefits of dichotomization in psychiatric and crimi-
nological research. Criminal Behaviour and Mental Health , 10 , 100–122.
Ferrari, P. A., & Barbiero, A. (2012). Simulating ordinal data. Multivariate Behavioral Research,
47, 566–589.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Fréchet, M. (1951). Sur les tableaux de corrélation dont les marges sont données. Annales de
l’Université de Lyon Section A, 14 , 53–77.
Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Bornkamp, B., Maechler, M., &
Hothorn, T. (2016). Multivariate normal and t distributions, R package mvtnorm. https://cran.r-
project.org/web/packages/mvtnorm..
project.org/web/packages/mvtnorm
Headrick, T. C. (2002). Fast fifth-order polynomial transforms for generating univariate and multi-
variate nonnormal distributions. Computational Statistics and Data Analysis, 40 , 685–711.
Headrick, T. C. (2010). Statistical Simulation: Power Method Polynomials and Other Transforma-
tions Boca Raton. FL: Chapman and Hall/CRC.
Hoeffding, W. (1994). Scale-invariant correlation theory. In N. I. Fisher & P. K. Sen (Eds.), The T he
Collected Works of Wassily Hoeffding (the original publication year is 1940) (pp. 57–107). New
York: Springer.
Inan, G., & Demirtas, H. (2016). Data generation with binary and continuous non-normal compo-
nents, R package BinNonNor. https://cran.r-project.org/web/packages/BinNonNor.
https://cran.r-project.org/web/packages/BinNonNor.
MacCal
Mac Callum
lum,, R. C., Zhang,
Zhang, S., Preacher
Preacher,, K. J., & Rucke
Rucker,r, D. D. (2002)
(2002).. On the practice
practice of
dichotomization of quantitative variables. Psychological Methods, 7 , 19–40.
R Development Core Team. (2016). R: A Language and Environment for Statistical Computing.
http://www.cran.r-project.org.
http://www.cran.r-project.org.
Revelle,
Rev elle, W. (2016).
(2016). Procedure
Proceduress for psycholo
psychological
gical,, psychomet
psychometric,
ric, and personality
personality researchm
researchmulti
ultivar
vari-
i-
ate normal and t distributions, R package psych. https://cran.r-project.org/web/packages/psych.
https://cran.r-project.org/web/packages/psych.
Vale, C. D., & Maurelli, V.V. A. (1983). Simulating multivariate nonnormal distributions. Psychome-
trika, 48 , 465–471.
Yuc
ucel
el,, R. M.,
M., & Demi
Demirt
rtas
as,, H. (201
(2010)
0).. Impa
Impact
ct of no
non-
n-no
norm
rmal
al rand
random
om effe
effect
ctss on in
infe
fere
renc
ncee by mult
multip
iple
le
imputation: A simulation assessment. Computational Statistics and Data Analysis, 54, 790–801.
Monte-Carlo Simulation of Correlated
Binary Responses
Trent L. Lalonde
1 Intr
Introd
oduc
ucti
tion
on
Correlated binary data occur frequently in practice, across disciplines such as health
policy analysis, clinical biostatistics, econometric analyses, and education research.
For example, health policy researchers may record whether or not members of a
househo
hous ehold
ld ha
have
ve hea
health
lth ins
insura
urance
nce;; econom
econometr
etric
ician
ianss ma
may
y be intere
intereste
sted
d in whethe
whetherr sma
small
ll
86 T.L. Lalonde
businesse
busine ssess within
within va
vario
rious
us urb
urban
an distri
districts
cts ha
have
ve app
applie
liedd for financi
financial
al ass
assist
istanc
ance;
e; higher
higher
educat
edu cation
ion res
resear
earche
chersrs might
might study
study the probab
probabili
ilitie
tiess of col
colle
lege
ge attend
attendanc
ancee for studen
students
ts
from a number of high schools. In all of these cases the response of interest can be
represented as a binary outcome, with a reasonable expectation of autocorrelation
among those responses. Correspondingly, analysis of correlated binary outcomes
has received considerable and long-lasting attention in the literature (Stiratelli et al.
1984;; Zeger and Liang 1986
1984 1986;; Prentice 1988;
1988; Lee and Nelder 1996; 1996; Molenberghs and
Verbeke 2006).
2006). The most common models fall under the class of correlated binary
logistic regression modeling.
While approp
appropriat
riately
ely develope
developed d logis
logistic
tic regressio
regression n models
models include
include asymptoti
asymptoticc esti-
esti-
mator properties, Monte-Carlo simulation can be used to augment the theoretical
resultss of suc
result such
h large-
large-sam
sample
ple distri
distribu
buti
tiona
onall pro
proper
pertie
ties.
s. Mon
Monte-
te-Car
Carlo
lo simula
simulatio
tion
n can be
used to confirm such properties, and perhaps more importantly, simulation methods
can be used to complement large-sample distributi
distributional
onal properties with small-sample
results. Therefore it is important to be able to simulate binary responses with speci-
fied
fie d au
auto
toco
corr
rrel
elat
atio
ionn so that
that corr
correl
elat
ated
ed bina
binary
ry da
datata mode
models ls can
can bene
benefit
fit fr
from
om simu
simula
lati
tion
on
studies.
Throughout the chapter, the interest will be in simulating correlated binary out-
comes, Yi j , where i indicates a cluster of correlated responses and j enumerates the
responses. It will be assumed that the simulated outcomes have specified marginal
probabilities, πi j , and pairwi
pairwise
se aut
autoco
ocorre
rrelat
lation
ion,, ρi j ,i k . The
The term
term “c
“clu
lust
ster
er”” will
will be used
used
to refer to a homogenous group of responses known to have autocorrelation, such
as an individual in a longitudinal study or a group in a correlated study. For many
of the algorithms
simplicity, presented
in which case theinfirst
thissubscript
discussion, a single
of the cluster
Y i j will will be and
be omitted considered for
Y i , πi , and
ρi j will be used instead. Some methods will additionally require the specification of
joint probabilities, higher-order correlations, or predictors.
1.
1.1
1 Bi
Bina
nary
ry Da
Data
ta Is
Issu
sues
es
m a x (0, πi + π − 1) ≤ π ≤ mi n(π , π ),
j i, j i j
l = max
− (πi π j )/(1 − π )(1 − π ),
i j −
(1 − π )(1 − π )/(π π )
i j i j
,
u = mi n
πi (1 − π )/π (1 − π ),
j j i
π j (1 − π )/π (1 − π )
i i j
,
(1)
to satisfy the requirements for the joint distribution. This implies that, depending
on the desired marginal probabilities, using a fully specified joint probability distri-
bution can lead to simulated values with restricted ranges of pairwise correlation.
The restrictions of Eq. 1 can be counterintuitive to researchers who are used to the
unconstrained correlation values of normal variables.
Somee method
Som methodss of sim
simula
ulatin
ting
g cor
correl
relat
ated
ed binary
binary outcom
outcomes
es str
strugg
uggle
le to contro
controll
changes in probabilities across clusters of correlated data. Due to the typically non-
linear
linear nature
nature of the relati
relations
onship
hipss betwee
between n respon
responsese pro
probab
babili
ilitie
tiess and predic
predictor
tors,
s, man
manyy
method
met hodss also fail
fail to incorp
incorpora
orate
te pred
predict
ictors
ors into
into dat
dataa simulati
simulation.
on. These
These issues
issues,, among
among
others, must be considered when developing and selecting a method for simulating
correlated binary data.
This chapter presents a thorough discussion of existing methods for simulating
correlated binary data. The methods are broadly categorized into four groups: cor-
related binary outcomes produced directly from a fully specified joint probability
distribution, from mixtures of discrete or continuous variables, from dichotomized
continuous or count variables, and from conditional probability distributions. The
oldest
old est litera
literatur
turee is avail
availabl
ablee for fully
fully specifi
specifieded joi
joint
nt probab
probabili
ility
ty dis
distri
tribu
butio
tions,
ns, bu
butt often
often
these
the se method
methodss are comput
computatation
ionall
ally
y int
intens
ensiive and requir
requiree spe
specifi
cificat
cation
ion of higher
higher-or
-order
der
pro
proba
binebabi
bili
liti
ties
es orofco
products corrrrel
elat
atio
binary ions
ns at the
the ou
variables outs
to tset
et.. Mi
induceMixt
xtur
ureedesired
the appr
approa
oach
ches
es ha
have
ve been
been used
autocorrelation, used to other
while com-
com-
mixtur
mix tures
es invo
involvi
lvingng contin
continuou
uouss varia
variable
bless requir
requiree dic
dichot
hotomi
omizat
zation
ion of the result
resulting
ing val-
val-
ues. Dichotomizing normal or uniform variables is a widely implemented approach
to producing binary data, although less well-known approaches have also been pur-
sued,
sue d, such
such as dichot
dichotomiomizin
zing
g cou
counts
nts.. Condit
Condition
ional
al spe
speci
cifica
ficatio
tion
n of a binary
binary dis
distri
tribu
butio
tion
n
typically makes use of “prior” binary outcomes or predictors of interest, and tends
to lead to the greatest range of simulated autocorrelation. Each of these methods for
generating correlated binary data will be discussed through a chronological perspec-
88 T.L. Lalonde
tive, with some detail provided and some detail left to the original publications. The
chapter concludes with general recommendations for the most effective binary data
simulation methods.
2 Fully Specified
Specified Joint
Joint Probabi
Probability
lity Distribu
Distributions
tions
The met
method
hod for simula
simulati
ting
ng correl
correlate
ated
d binary
binary out
outcom
comeses with
with the longes
longest-t
t-tenu
enured
red pre
pres-
s-
ence in the literature, is full specification of a joint probability distribution for cor-
related binary variates. The joint pdf can either be written explicitly in closed form,
or derived by exhaustive listing of possible outcomes and associated probabilities.
(1986).
In all cases the generation of data relies on the method of Devroye (1986 ).
2.1 Simulating Binary Data with a Joint PDF
z0 = 0,
zj =z− +p
j 1 ( j ),
z 2 N −1 = 1.
3. Gener
Generate
ate a standard
standard uniform varia te U on ( 0, 1).
variate
4. Selectt j such that z j
Selec ≤
U < z j +1 .
5. The random sequsequence
ence is the binary repre
representa
sentation integer j .
tion of the integer
provide
provided,
d, an
andd the
the me
meth
thod
od is re
reli
lied
ed on exten
xtensi
sive
vely
ly in situ
situat
atio
ions
ns in wh
whic
ich
h a jo
join
intt pdf
pdf can
can
be constr
construct
ucted.
ed. In fac
fact,
t, rel
relyin
yingg on this
this algori
algorithm
thm,, many
many author
authorss ha
have
ve pursue
pursued d method
methodss
of constructing a joint pdf as a means to generate vectors of binary responses.
Simu
Simula
lati
tion
on of co
corr
rrel
elat
ated
ed bina
binary
ry ou
outc
tcom
omes
es is gene
genera
rall
lly
y th
thoug
ought
ht to begi
begin
n with
with Baha
Bahadu
durr
(1961). (1961)) presented a joint pdf for N correlated binary random vari-
1961). Bahadur (1961
ables as follows. Let yi indicate realizations of binary variables, π P (Yi 1)
represent the constant expectation of the binary variables, all equivalent, and ρi j = =
be the autocorrelation between two variables Yi and Y j . Then the joint pdf can be
written,
ρi j ( 1)( yi + y j ) π (2− yi − y j ) (1 ( yi +y )
− − π) j
i = j
f ( y1 , . . . , y N ) =1+ π (1 − π) .
While Bahadur ((1961 1961)) expressed the joint pdf in terms of lagged correlations for
an autoregressive time series, the idea expands generally to clustered binary data. In
practice it is necessary to estimate range restrictions for the autocorrelation to ensure
∈
f ( y1 , . . . , y N ) (0, 1). These restrictions depend on values of π , and are typically
determine
deter mined d empirica
empirically
lly (Fa
(Farrell
rrell and Sutradha
Sutradharr 2006
2006).). Usin
Using
g th
thee algo
algori
rith
thmm of Devr
Devroy
oyee
(1986),
1986), the model of Bahadur Bahadur (1961)
1961) can be used to simulate values for a single
cluster or longitudinal subject, then repeated for additional clusters. This allows the
probability π to va vary
ry acr
across
oss clu
cluste
sters,
rs, howe
howeve
verr, π is ass
assume
umedd consta
constantnt within
within cluste
clusters,
rs,
reducing the ability to incorporate effects of covariates
covariates.. Most crucially,
crucially, the pdf given
by Bahadur
Bahadur (1961 1961)) will become computationally burdensome for high-dimensional
data simulations (Lunn and Davies 1998; 1998; Farrell and Sutradhar 2006 2006).).
1961),
Instead of relying on the joint pdf of Bahadur ((1961 ), authors have pursued the
construction of joint probability distributions not according to a single pdf formula,
but instead by using desired properties of the simulated data to directly calculate all
possible
possib le probab
probabilit
ilities
ies assoc
associated with N -di
iated -dimen
mensio
sional
nal sequen
sequences
ces of binary
binary out
outcom
comes.
es.
Often these methods involve iterative processes, solutions to linear or nonlinear
systems of equations, and matrix and function inversions. They are computationally
complex but provide complete information about the distribution of probabilities
without using a closed-form pdf.
Lee (1993
(1993)) introduced a method that relies on a specific copula distribution, with
binary variables defined according to a relationship between the copula parameters
90 T.L. Lalonde
and the desired probabilities and correlation for the simulated binary responses.
A copula (Genest and MacKay 1986a 1986a,, b) is a multivariate distribution for random
variables ( X 1 , . . . , X N ) on the N -dimensional unit space, such that each marginal
distribution for X i is uniform on the the domain (0, 1). The variables X 1 , . . . , X N
can be used to generate binary variables Y 1 , . . . , Y N by dichotomizing according to
Yi = I ( X i > πi ), where πi is the desired expectation of the binary variable Yi and
I () takes the value of 1 for a true argument and 0 otherwise. However, Lee ((1993 1993))
did not use a simple dichotomization of continuous variables.
In order to extend this idea to allow for autocorrelation, Lee (1993 1993)) proposed
using the copula with exchangeable correlation, the Archimidian copula proposed
by Gene
Genest st and
and MacK
MacKay ay (1986a).
1986a). Use
Use of su
such
ch a co
copul
pulaa will
will in
indu
duce
ce Pe
Pear
arso
son
n co
corre
rrela
lati
tion
on
between any two random binary variables Y i and Y j given by,
ρi j = −π π
π 0 ,0
,
i j
(2)
π π (1 − π )(1 − π )
i j i j
where π0,0 is the joint probability that both variables Yi and Y j take the value 0.
Because of the restriction that π0,0 ≤
mi n (πi , π j ), it follows that π0,0 > πi π j and
therefore the induced Pearson correlation will always be positive when this method
1993)) requires the correlation of Eq. 2 to be constant
is applied. The method of Lee ((1993
within clusters.
1. For a single cl
cluster,
uster, specify the desired m marginal
arginal probabilit
probabilitiesies πi , pairwise
probabilities π i , j , and up to k -order joint probabilities πi 1 ,...,
,...,ii k .
2. Const
Constru
ruct
ct a log-
log-li
linenear
ar mo
mode
dell with
with inte
intera
ract
ctio
ions order k corresponding
ns up to order
to the constraints specified by the probabilities up to order k .
3. Fit the log-line
log-linear
ar model to obta
obtain
in estimated probabilities corresponding
corresponding to
the joint marginal pdf.
4. Simul
Simulate
ate binary val
values
ues accordi
according
ng to the algorithm
algorithm of Devroye (1986).
Devroye (1986).
5. Repea
Repeatt for additional
additional clusters.
clusters.
92 T.L. Lalonde
1995).
exemplified by Gange ((1995
exemplified ). Further, this method requires an iterative procedure to
solve a system of nonlinear equations, which can become computationally burden-
some with increased dimension of each cluster.
3 Specifica
Specification
tion by Mixture
Mixture Distribu
Distributions
tions
Yi =U i
− ⊕ +
Yi 1 Wi (1 − U )W ,
i i
⊕
where indindica
icate
tess additi
addition modulo 2, Ui can
on modulo can be take
taken
n to be Berno
Bernoul
ulli
li with
with pr
prob
obab
abil
ilit
ity
y
πU and Wi can be taken to be Bernoulli with probability πW (π((1 πU ))/(1
(π = − −
2π πU ). Assuming Yi −1 , Ui , and Wi to be independent, it can be shown that all Yi
have expectati
expectation
on π and that the autocorrelation between any two outcomes is
Corr(Yi , Y j ) =
πU (1 − 2π ) | − |
i j
.
1 − 2π π U
The method of Kanter (1975(1975)) requires πU (0, m i n ((1 π)/π, 1)), and includes
∈ −
restrictions on the autocorrelation in the simulated data based on the probabilities
used. In particular, Farrell and Sutradhar (2006
2006)) showed that no negative autocor-
relation can be generated with any probability π chosen to be less than or equal to
0.50. In addi
additi
tion
on,, this
this me
meth
thod
od does
does not allo
allow
w fo
forr easy
easy varia
ariati
tion
on of proba
probabi
bili
liti
ties
es wi
with
thin
in
clusters or series.
Use of a mixture of binary random variables was updated by Lunn and Davies
(1998),
1998), who proposed a simple method for generating binary data for multiple clus-
ters simultaneously,
simultaneously, and for various types of autocorrelation structure within groups.
Suppose the intention is to simulate random binary variables Yi j such that the expec-
tation of each is cluster-dependent, πi , and the autocorrelation can be specified. First
assume a positive, constant correlation ρ i is desired within clusters. Then simulate
binary values according to the equation,
Yi j = (1 − U i j )Wi j +U i j Zi , (4)
Yi j = (1 − Ui j ) Wi j +U i j Zi (Exchangeable),
Yi j = (1 − Ui j ) Wi j +U i j Yi , j 1
− (Autoregressive),
Yi j = (1 − Ui j ) Wi j +U i j Wi , j −1 (M-Dependent).
Y˜ =A
Ỹi j i j Yi j ,
94 T.L. Lalonde
The issue
issue of allow
allowing ing within
within-cl
-clust
uster
er va
varia
riati
tion
on of succe
success
ss pro
probab
babili
ilitie
tiess wa
wass
addressed by Oman and Zucker Zucker (2001
2001)) in a method that is truly a combination
of mixture variable simulation and dichotomization of continuous values. Oman and
Zuckerr (2001
Zucke (2001)) argued that the cause of further restricted ranges of autocorrelation in
simulated binary data, beyond the limits from Eq. 1 given by Prentice (1988
(1988), ), is the
combination of varying probabilities within clusters and the inherent mean-variance
relationship of binary data. Assume the interest is in simulating binary variables Yi j
with probabilities π i j , varying both between and within clusters. To define the joint
probability distribution of any two binary outcomes Yi j and Yi k , define
P (Y Y 1) (1 ν )π π ν mi n (π , π ), (5)
ij × ik = = − jk ij ik + jk
jk ij ik
where ν j k is chosen to reflect the desired correlation structure within groups, as fol-
lows
lows.. The
The join
jointt dist
distri
ribu
buti
tion
on sp
spec
ecifi
ified Eq. 5 allo
ed by Eq. allows
ws th
thee corr
correl
elat
atioion
n betw
betwee
een
n an
any
y two
two
responses within a cluster, denoted ρ j k , to be written ρ j k ν j k m a x (ρst ), where = ×
the maximum is taken over all values of correlation within cluster i . The process
of generating binary values is constructed to accommodate this joint distribution as
follows. Define responses according to
Yi j = I (Z ≤ θ ij i j ),
where θi j = F −1 (πi j ) fo
forr any
any cont
contin
inuo
uous CDF F ,and Z i j is de
us CDF defin
fined
ed as an appr
appropr
opria
iate
te
mixture. Similarly to the method of Lunn and Davies (1998 ( 1998),
), Oman and Zucker
(2001
2001)) provide mixtures for common correlation structures,
4. Gen
Genera te U i j as Bernoulli, X i j according to continuous F , and calculate
erate
Z i j and θi j .
5. Define each outcomoutcomee as Yi j I (Zi j =θi j ). ≤
6. Repea
Repeatt for additional
additional clus
clusters
ters..
Oman and Zucker (2001)2001) noted that covariates can be incorporated by defining
T
=
θi j xi j β as the systematic component of a generalized linear model and taking
F −1 to be the associated link function. It is an interesting idea to use the inverse link
funct
fun ctio
ion
n from
from a ge
gene
nera
rali
lize
zed
d line
linear
ar mo
mode
dell to he
help
lp co
conn
nnec
ectt pr
pred
edic
icto
tors
rs to th
thee dete
determ
rmin
ina-
a-
tion of whether the binary realization will be 0 or 1. However, the method described
continues to suffer from restricted ranges of autocorrelation, most notably that the
correlations between binary responses must all be positive.
Y = ci Z (i −1) ,
=
i 1
96 T.L. Lalonde
Secondly, the resulting mixture random variable is a power function of standard nor-
mal variables, which generally will not reflect the true mean-variance relationship
nece
ne cess
ssar
aryy for
for bina
binary
ry da
data
ta.. Whil
Whilee the
the va
valu
lues
es of me
mean an and
and vari
varian
ance
ce usin
usingg th
thee Flei
Fleish
shma
mann
method can in some cases reflect those of an appropriate sample, the dynamic rela-
tionship between changing mean and variation will not be captured in general. Thus
as the mean changes, the variance generally will not show a corresponding change.
Fina
Finall
lly
y, the
the me
meththod
od does
does not re
read
adilily
y acco
accoun
untt fo
forr th
thee effe
effect
ctss of pr
pred
edic
icto
tors
rs in simu
simula
lati
ting
ng
responses. In short, such methods are poorly equipped to handle independent binary
data, let alone correlated binary outcomes.
4 Simulatio
Simulation
n by Dichotomi
Dichotomizing
zing Variates
ariates
Perhaps the most commonly implemented methods for simulating correlated binary
outco
outcome
mess are
are thos
thosee that
that invo
involv
lvee dich
dichot
otom
omiz
izat
atio
ion
n of ot
othe
herr ty
type
pess of vari
variab
able
les.
s. Th
Thee most
most
frequent choice is to dichotomize normal variables, although defining thresholds for
uniform variables is also prevalent.
Many methods of dichotomizing normal variables have been proposed. The method
of Emrich and Piedmonte (1991 (1991),
), one of the most popular, controls the probabilities
and pairwise correlations of resulting binary variates. Assume it is of interest to
simulate
simul ate binary
binary variabl es Yi with assoc
variables associate
iated
d probab
probabilit
ilities
ies πi an
andd pairwise
pairwise corre
correlati
lations
ons
give
given =
n by Corr (Yi , Y j ) ρi j . Begi
Begin n by so
solv
lvin
ing
g th
thee fo
foll
llow
owin
ingg equa
equati
tion
on fo
forr th
thee norm
normal
al
pairwise correlation, δi j , using the bivariate normal CDF Φ ,
Φ (z (πi ), z (π j ), δi j ) =ρ ij
πi (1 − π )π (1 − π ) + π π ,
i j j i j
where z () indicates the standard normal quantile function. Next generate one N -
multivariate normal variable Z with mean 0 and correlation matrix with
dimensional multivariate
components δ i j . Define the correlated binary realizations using
Yi = I ( Z ≤ z(π )).
i i
This method is straightforward and allows probabilities to vary both within and
between clusters. A notable disadvantage of this method is the necessity of solving
a system of nonlinear equations involving the normal CDF, which increases compu-
tational burden with large-dimensional data generation.
98 T.L. Lalonde
1. Dete
Determine
rmine the autocorr
autocorrelation desired within and between N clusters of
elation
binary responses, and select first-stage probabilities π 1 , . . . , π N .
2. Simul ate X 1 as I (U1 < π1 ), where U1 is a random uniform realization on
Simulate
(0, 1).
3. Gener
Generate
ate the rema
remaining first-stage binary outcomes X i according to X 1
ining first-stage
and corresponding random uniform variables Ui .
4. Solv
Solvee for the second-stag
second-stagee threshold
thresholds,s, πi j , given the desired within and
between correlation values, ρ Yi j ,Yik and ρ Yi j ,Ykl , respectively.
5. Gener
Generate
ate the second-stage
second-stage bina ry outcomes Y i j according to X i and cor-
binary
responding random uniform variables Ui j .
Thee meth
Th method
od of Park
Park et al.
al. (1996
1996)) si
simu
mula
late
tess co
corr
rrel
elat
ated
ed bi
bina
nary
ry valu
values
es usin
using
g a
dichotomization
system of nonlinear of equations.
counts, andAssume
in the process
an interest avoids the necessity
in generating of solving
N correlated any
binary
variables Y1 , . . . , Y N , with
with pro
probab
babili
ilitie
tiess π1 , . . . , π N and associ
associat
ated
ed pairwi
pairwise
se cor
correl
rela-
a-
tions ρi j . Begin by generating N counts Z 1 , . . . , Z N using a collection of M Poisson
random variables X 1 (λ1 ) , . . . , X M (λ M ), as linear combinations,
Z1 =
X i (λi ),
∈ i S1
Z2 =
X i (λi ),
∈
i S2
..
.
ZN =
X i (λi ).
∈
i SN
Notice that each count Z i is a combination of a specific set of the Poisson ran-
dom variables, denoted by Si . The number of Poisson variables, M , the associated
means, λi , and the sets used in the sums, Si , are all determined algorithmically based
on the desired probabilities and correlations. Each binary value is then defined by
dichotomizing, Y i = I ( Zi 0). =
Park et al. 1996) describe the determination of M , λi , and Si as follows. The
al. (1996)
Poisson means λ i can be constructed as linear combinations of parameters α i j , 1 ≤
i, j ≤N . The α i j can be calculated based on the desired probabilities and pairwise
correlations, +
αi j = ln 1 ρi j (1 − π )(1 − π )/(π π )
i j i j .
5 Condition
Conditionally
ally Specified
Specified Distribu
Distributions
tions
Recent attention has been paid to conditionally specifying the distribution of corre-
lated binary variables
variables for the purposes of simulation. While the mixture distribut
distributions
ions
can be viewed as conditional specifications, in such cases discussed Sect. 3 the mix-
tures were defined so that the marginal distributions of the resulting binary variables
were completely specified. In this section the discussion focuses on situations with-
out full specification of the marginal outcome distribution. Instead, the distributions
are defined using predictor values or prior outcome values.
sh (2003)
Qaqish
Qaqi 2003) intro
ntrodu
ducced a memeth
thod
od of simu
simullatin
ating
g bi
bina
nary
ry varariiate
ates us
usin
ing
g
autoregressive-type relationships to simulate autocorrelation. Each outcome value
is conditioned on prior outcomes, a relationship referred to as the conditional lin-
ear family. The conditional linear family is defined by parameter values that are
so-called reproducible in the following algorithm, or those that result in conditional
means within the allowable range ( 0, 1).
Suppose the interest is in simulating correlated binary variables Yi with associ-
ated probabilities πi , and variance-covariance structure defined for each response by
its covariation with all previous responses, si Cov( Y1 , . . . , Yi −1 T , Yi ). Qaqish
= [ ]
2003)) argued that the expectation of the conditional distribution of any response Yi ,
(2003
given all previous responses, can be expressed in the form,
T
κ iT T T
[ |[
E Yi Y1 , . . . , Yi −1 ] ]=π i + [ Y1 , . . . , Yi −1 ] − [π , . . . , π
1 −]
i 1
−
i 1
=π i + κi j (Y j − π ), j
j 1
=
(6)
κi = [Cov([Y , . . . , Y − ] )]− s .
1 i 1
T 1
i
The correlated binary variables are then generated such that Y1 is Bernoulli with
probability π1 , and all subsequent variables are random Bernoulli with probability
given by the conditional mean in Eq. 6. It is straightforward to show that such a
sequen
seq uence
ce will
will ha
have
ve the desire
desired
d ex
expec
pectation π1 , . . . , π N T and autocorrel
tation [
autocorrelation
ationdefine
defined
d ]
by the variance-covariance s i . Qaqish (2003
(2003)) provides simple expressions for κ i j to
produce exchangeable, auto-regressive, and moving average correlation structures,
as follows,
1/2
ρ Vi i
κi j =1 + −
(i 1)ρ Vj j
(Exchangeable),
1/2
Vi i
λi =π i + −− −
ρ ( yi 1 πi 1)
Vi −1,i −1
(Autoregressive) ,
1/2
κi j = βj − −
β j
Vi i
(Moving Average),
β −i −β i Vjj
where Vi i rep
repres
resent
entss diagon
diagonal
al ele
elemen
ments
ts of the respons
responsee varia
variance
nce-co
-cova
varia
riance
nce,,
λi E Yi Y1 , . . . , Yi 1 T represents the conditional expectation, and β (1
4ρ 2 )1/2 1 /2ρ with −ρ the decaying correlation for autoregressive models and the
= [−|[ ] ]] =[ −
single time-lag correlation for moving average models.
An interesting property of the method presented by Qaqish ((2003 2003)) is the nature of
includ
inc luding
ing pri
prior
or bin
binary
ary out
outcom
comes.es. The ter
terms −
ms κi j (Y j π j ) sho
show ththat
at a bi
bina
nary
ry re
resp
spons
onsee
varia
variable
ble is includ
included
ed rel
relat
ativ
ivee to its
its ex
expec
pecta
tatio
tion
n and transf
transform
ormed
ed accord
according
ing to a consta
constant
nt
related to the desired autocorrelation. While this method does not explicitly include
predictors
predic tors in the simulat
simulation
ion algorithm,
algorithm, predi
predictors
ctors could be included
included as part of each
expected
expe cted value
value πi . The
The me
meththod
od clea
clearl
rly
y allo
allows
ws fo
forr both
both posi
positi
tive
ve and
and nega
negatitive
ve valu
values
es of
autocorrelation, unlike many other proposed methods, but restrictions on the values
of the autocorrelation remain as discussed by Qaqish ((2003 2003).).
The most general method in this discussion is based on the work of Farrell and
Sutradhar (2006
(2006),
), in which a nonlinear version of the linear conditional probability
mode
mo dell propo
propose
sed
d by Qaqi
Qaqish
sh (2003)
2003) is cons
constr
truc
ucte
ted.
d. Th
Thee mode
modell of Farr
Farrel
elll an
and
d Sutr
Sutrad
adha
harr
2006)) is conditioned not only on prior binary outcomes in an autoregressive-type
(2006
of sequence, but also on possible predictors to be considered in data generation.
This approach allows for the inclusion of covariates in the conditional mean, allows
for the probabilities to vary both between and within clusters, and allows for the
greatest range of both positive and negative values of autocorrelation. However, the
nonlinear conditional probability approach of Farrell and Sutradhar (2006 ( 2006)) does not
explicitly provide methods for controlling the probabilities and correlation structure
at the outset of data simulation.
Assume an interest in simulating correlated binary variables, Yi , where each out-
come
co me is to be as
asso
soci
ciat
ated
ed wi
with
th a vect
vector
or of pr
pred
edic
ictors,, xi , th
tors thro
roug
ugh
h a vect
vector
or of para
parame
mete
ters
rs,,
β . Farrell and Sutradhar (2006
(2006)) proposed using the non-linear conditional model,
−
i 1
ex p(xiT β + γk Yk )
k 1
E Yi Y1 , . . . , Yi −1 , xi
[ |[ ] ]= =−i 1 . (7)
1 + ex p(xiT β + γk Yk )
=
k 1
= E[Yi ] = P (Yi = 1|Y0 = 1) + E[Yi −1 ] P (Yi = 1|Y1 = 1) − P (Yi = 1|Y0 = 1)
µi
,
µ i (1 − µi )
Corr(Yi , Y j ) = P (Yk = 1|Y1 = 1) − P (Yk = 1|Y0 = 1) .
µ j (1 − µ j )
k ∈(i , j ]
1. For a single
cients β , andcluste
clusterr of corre
correlate
autoregressive latedd binary data,
coefficients γ k . selec
selectt predictors x i , coeffi-
predictors
2. Simul
Simulate ate Y1 as Bernoulli with probability π1 = (ex p(xiT β ))/(1 ex p +
(xiT β )).
3. Simul
Simulate ate subseque
subsequent nt Yi according to the conditional probability E Yi [ |
[ ] ]
Y1 , . . . , Yi −1 , xi .
4. Repea
Repeatt for additional
additional clus
clusters
ters..
The intuition behind such an approach is that the predictors xi and also the previous
outcome variables Y 1 , . . . , Yi −1 are combined linearly but related to the conditional
mean through the inverse logit function, as in Eq. 7. The inverse logit function will
map any real values to the range (0, 1), thus avoiding the concern of reproducibility
2003).
discussed by Qaqish ((2003 ).
6 Softwa
Software
re Discu
Discussi
ssion
on
Few of the methods discussed are readily available in software. The R packages
bindata and BinNor util
utilize
ize discreti
discretizati
zations
ons of normal
normal random variables,
variables, but not
T
P (Yi j = 1) = F (x i j β ),
Table 1 Advantages
Advantages and disadvantages of methods of correlated binary outcome simulation
Simulation type Prominent example
Lee (1993),
Fully specified joint distribution Lee 1993), using the Archimidian copula
Advantages
Control of probabilities, correlation, and higher-order moments
Disadvantages
Computationally expensive, nonlinear systems, function/matrix
inversions
Mixtur
ture distribution
ions Oman and Zucker (2001),
2001), using a mixture of binary and
continuous variables
Advantages
Simple algorithms, controlled correlation structures
Disadvantages
Constant probabilities within clusters, no predictors
Di
Dich
chot
otom
omiz
izin
ing
g varia
ariabl
bles
es Emric
Emrich
h an
and
d Pi
Pied
edmo
mont 1991),
ntee ((1991 ), using dichotomized multivariate
normals
Advantages
Short algorithms, probabilities vary within clusters,
Correlation between clusters
Disadvantages
Nonlinear systems, computationally expensive
Cond
Condit
itio
iona
nall dist
distri
rib
but
utio
ions
ns Qaqi
Qaqish
sh (2003),
2003), using the linear conditional probability model
Advantages
Widest range of correlations, controlled correlation structures,
predictors
Disadvantages
Complicated algorithms, requiring conditional means and
covariances
re
rece
cent
nt,, and
and al
allo
low
w fo
forr the
the gr
grea
eate
test
st ra
rang
ngee of corr
correl
elat
atio
ions
ns to be simu
simula
late
ted.
d. The
The algo
algori
rith
thm
m
(2003)) is an ideal example, with few disadvantages other than a slightly
of Qaqish (2003
limited range of correlation values, but allowing the inclusion of predictors, prior
References
Farrell, P. J., & Sutradhar, B. C. (2006). A non-linear conditional probability model for generating
correlated binary data. Statistics & Probability Letters, 76, 353–361.
Fleishman, A. I. (1978). A method for simulating non-normal distributions. Psychometrika, 43,
521–532.
Gange, S. J. (1995). Generating multivariate categorical variates using the iterative proportional
fitting algorithm. The American Statistician, 49 (2), 134–138.
Genest, C., & MacKay, R. J. (1986a). Copules archimediennes et familles de lois bidimenionnelles
dont les marges sont donnees. Canadian Journal of Statistics , 14, 280–283.
Genest, C., & MacKay, R. J. (1986b). The joy of copulas: Bivariate distributions with uniform
marginals. The American Statistician, 40, 549–556.
Headrick, T. C. (2002a). Fast fifth-order polynomial transforms for generating univariate and mul-
tivariate T.
Headrick, nonC.normal distributions.
(2002b). Jmasm3: A Computational Statistics &
method for simulating Data Analysis
systems , 40, 685–711.
of correlated binary data.
Journal of Modern Applied Statistical Methods, 1, 195–201.
Headri
Headrick,
ck, T. C. (2010)
(2010).. Statis
Statistical
tical simulation
simulation:: Power method
method polynomial
polynomialss and other transform
transformation
ationss
(1st ed.). Chapman & Hall/CRC, New York.
Headrick, T. C. (2011). A characterization of power method transformations through l-moments.
Probability and Statistics, 2011.
Journal of Probability
Kang,
Kan g, S. H.,& Jung,
Jung, S. H. (2001)
(2001).. Genera
Generatin
ting
g correl
correlate
ated
d binary
binary va
varia
riable
bless with
with comple
complete
te specifi
specificat
cation
ion
of the joint distribution. Biometrical Journal, 43(3), 263–269.
Kanter,
Kante r, M. (1975).
(1975). Autoregre
Autoregression
ssion for discrete
discrete processes
processes mod 2. Journal of Applied Probability,
12, 371–375.
Karian, Z. A., & Dudewicz, E. J. (1999). Fitting the generalized lambda distribution to data: A
method based on percentiles. Communications in Statistics: Simulation and Computation, 28,
793–819.
Koran, J., Headrick, T. C., & Kuo, T. C. (2015). Simulating univariate and multivariate no normal
distributions through the method of percentiles. Multivariate Behavioral Research
Research, 50, 216–232.
Lee, A. J. (1993). Generating random binary deviates having fixed marginal distributions and spec-
ified degrees of association. The American Statistician: Statistical Computing, 47 (3), 209–215.
Lee, Y., & Nelder, J. A. (1996). Hierarchical generalized linear models. Journal of the Royal
Statistical Society, Series B (Methodological), 58 (4), 619–678.
Lunn, A. D., & Davies, S. J. (1998). A note on generating correlated binary variables. Biometrika,
85(2), 487–490.
Molenberghs, G., & Verbeke, G. (2006). Models for discrete longitudinal data (1st ed.). Springer.
Oman, S. D., & Zucker, D. M. (2001). Modelling and generating correlated binary variables. Bio-
metrika, 88 (1), 287–290.
Park, C. G., Park, T., & Shin, D. W. (1996). A simple method for generating correlated binary
variates. The American Statistician, 50 (4), 306–310.
Prentice, R. L. (1988). Correlated binary regression with covariates specific to each binary obser-
vation. Biometrics, 44, 1033–1048.
Qaqish, B. F. (2003). A family of multivariate binary distributions for simulating correlated binary
variables with specified marginal means and correlations. Biometrika, 90 (2), 455–463.
Stiratelli, R., Laird, N., & Ware, J. H. (1984). Random-effects models for serial observations with
binary response. Biometrics, 40, 961–971.
Touloum
ouloumis, is, A. (2016)
(2016).. Simula
Simulatin
ting
g correl
correlate
ated
d binary
binary and multin
multinomi
omial
al respon
responses
ses with
with simcor
simcormul
multre
tres.
s.
The Comprehensive R Archive Network 1–5.
Vale, C. D., & Maurelli, V. A. (1983). Simulating multivariate no normal distributions. Psychome-
trika, 48, 465–471.
Zeger
Zeg er,, S. L., & Liang,
Liang, K. Y. (1986
(1986).
). Longit
Longitudi
udinal
nal data
data analys
analysis
is for discre
discrete
te and contin
continuou
uouss outcom
outcomes.
es.
Biometrics, 42, 121–130.
H.K.T. Ng (B)
Department of Statistical Science, Southern Methodist University,
Dallas, TX 75275, USA
e-mail: [email protected]
Y.-J. Lin
Department of Applied Mathematics, Chung Yuan Christian University, Chung-Li District,
Taoyuan city 32023, Taiwan
e-mail: [email protected]
[email protected]
T.-R. Tsai
Department of Statistics, Tamkang University, Tamsui District, New Taipei City, Taiwan
e-mail: [email protected]
·
Y.L. Lio N. Jiang
Department of Mathematical Sciences, University of South Dakota,
Vermillion, SD 57069, USA
e-mail: [email protected]
N. Jiang
e-mail: [email protected]
1 Intr
Introd
oduc
ucti
tion
on
sensitivity
ification. Itand willuncertainty
be useful to of see
the optimal
whetherexperiment
a proposedscheme due toscheme
experiment model misspec-
is robust
to model misspecification. If a design is indeed robust, it would then assure the
practitioners that misspecification in the model would not result in an unacceptable
change in the precision of the estimates of model parameters. In this chapter, we
discuss the analytical and Monte-Carlo methods for quantifying the sensitivity and
uncertainty of the optimal experiment scheme and evaluate the robustness of the
optimal experiment scheme.
Let θ be the parameter vector of lifetime distribution of test items. The com-
monly used procedures for the determination of the optimal experiment scheme
are described as follows: A-optimality, that minimizes the trace of the variance-
covariance matrix of the maximum likelihood estimators (MLEs) of elements of θ ,
provides an overall measure of variability from the marginal variabilities. It is par-
ticularly useful when the correlation between the MLEs of the parameters is low. It
is also pertinent for the construction of marginal confidence intervals for the para-
mete
me rs in θ . D-opti
ters -optimali
mality
ty,, that minim
minimizes
izes the determina
determinant nt of the varianc
variance-co
e-covar
variance
iance
matrix of the MLEs of components of θ , provides an overall measure of variability
by taking into account the correlation between the estimates. It is particularly useful
when the esti
estimate
matess are highly correlat
correlated.
ed. It is also pertinent
pertinent for the construction
construction of
joint confidence regions for the parameters in θ . V -optimality, that minimizes the
variance of the estimator of lifetime distribution percentile.
In Sec
Sect.t. 2, we pre
presen
sentt the notnotati
ation
on and gengenera
erall methods
methods for quanti
quantifyi
fying
ng the
uncertainty in the optimal experiment scheme with respect to changes in model.
2 Quantifyi
Quantifying
ng the
the Uncer
Uncertain
tainty
ty in the Optima
Optimall Experime
Experiment
nt
Scheme
In a life
life-t
-tes
esti
ting
ng expe
experi
rime
ment
nt,, let
let the
the life
lifeti
time
mess of test
test item
itemss fo
foll
llo
ow a fa
fami
mily
ly of stat
statis
isti
tica
call
model M . We are interested in determining the optimal experiment scheme that
optimizes the objective function Q (S , M0 ), where S denotes an experiment scheme
and M0 denotes the true model. In many situations, the determination of the optimal
experiment scheme requires a specification of the unknown statistical model M
and hence the optim
optimal
al expe
experime
rimentnt sche
schememe depen
dependsds on the specified model M . For
instance, in experimental design of multi-level stress testing, (Ka et al a l, 2011)
2011) and
Chan et al.
al. (2016
2016)) considered the extreme value regression model and derived the
expected Fisher information matrix. Consequently the optimal experiment schemes
obtained in Ka et al a l (2011)
2011) and Chan et al. al. (2016)
2016) are specifically for the extreme
value regression model, which may not be optimal for other regression models.
Here, we denote the optimal experiment scheme based on a specified model M
as S ∗ (M ). In the ideal situation, the model specified for the optimal experimental
sche
scheme
me is the
the true
true mo
mode del,
l, i.e.,, Q(S ∗ (M0 ), M0 ) inf Q(S (M0 ), M0 ) and S ∗ (M0 )
i.e. = =
S
ar
arg inf Q(S (M0 ), M0 ).
g inf
S
On the other hand, in determining the optimal experiment scheme, experimenter
always relies on the asymptotic results which are derived based on the sample size
goes to infinity. Nevertheless, in practice, the number of experimental units can be
used in an experiment is finite and thus the use of the asymptotic theory may not
be appropriate. For instance, for A -optimality, the aim is to minimize the variances
of the estimated model parameters. This is always attained through minimizing the
trace of the inverse of the Fisher information matrix or equivalently, the trace of the
asymptotic variance-covariance
variance-covariance matrix of MLEs. However
However,, the asymptotic variance-
covariance matrix may not correctly reflect the true variations of the estimators
when the sample size is finite, and hence the optimal experiment scheme may not
be optimal or as efficient as expected in finite sample situations. Therefore, large-
scale Monte-Carlo simulations can be used to estimate the objective functions and
evaluate the performance of the optimal experiment scheme. For quantifying the
sensitivity and uncertainty of the optimal experiment scheme S ∗ (M ), we describe
two pos
possib
sible
le approa
approache
chess by compar
comparing
ing exper
experime
imenta
ntall scheme
schemess and object
objectiv
ivee functio
functions
ns
in the following subsections.
Let the specified model for obtaining the optimal experiment scheme be M ∗ , then
the optimal experiment scheme is
S ∗ (M ∗ ) inf Q(S (M ∗ ), M ∗ ).
= arargg inf
S
S ∗ (M ) = arargg inf
inf Q(S (M ), M ).
S
To qua
quanti
ntify
fy the sen
sensit
sitiv
ivity
ity of the opt
optima
imall exper
experime
iment scheme S ∗ (M ∗ ), anothe
nt scheme anotherr
approach is to compare the objective function of the optimal experiment scheme
S ∗ (M ∗ ) under the model (M ) which is believed to be the true model. Specifi-
cally, we can compute the objective function when the experiment scheme S ∗ (M ∗ )
is adopted but the model is M , i.e., to compute Q(S ∗ (M ∗ ), M ). If the optimal
experiment scheme S ∗ (M ∗ ) is insensitivity to the change in the model M , then
Q(S ∗ (M ∗ ), M ∗ ) will be similar to Q(S ∗ (M ∗ ), M ) or Q(S ∗ (M ), M ). When apply
this approach, evaluation of the objective function Q (S ∗ (M ∗ ), M ) is needed.
3 Progre
Progressiv
ssivee Censor
Censoring
ing with Location-
Location-Scal
Scalee Family
Family
of Distributions
In this section, we illustrate the proposed methodology through the optimal progres-
si
sive
ve Type-II
ype-II ce
censo
nsorin
ringg scheme
schemess (see,
(see, for examp
example,
le, Balakr
Balakrish
ishnan
nan and Aggarw
Aggarwal 2000;;
alaa 2000
Balakrishnan 2007
2007;; Balakrishnan and Cramer 2014
2014).). We consider that the underline
statistical model, M , used for this purpose is a member of the log-location-scale
family of distributions. Specifically, the log-lifetimes of the units on test have a
1 x − µ,
;
fX (x µ , σ ) = g
(1)
σ σ
·
where g ( ) is the standard
standard fform
orm of the
the p.d.f. ; ·
p.d.f. fX (x µ , σ ) and G ( ) is the sstanda
tandard
rd form
;
of the c.d.f. F X (x µ , σ ) when µ =
0 and σ =
1. The functional forms g and G are
completely specified and they are parameter-free, but the location and scale parame-
ters, −∞ <µ< ∞ ; ;
and σ > 0 of fX (x µ , σ ) and FX (x µ , σ ), are unknown. Many
well-kno
well-k nown
wn proper
propertie
tiess for locati
location-s
on-scal
calee fa
famil
mily
y of distri
distribu
butio
tions
ns had been
been establ
establish
ished
ed
in the literature (e.g., Johnson et al. 1994).
1994). This is a rich family of distributions that
include the normal, extreme value and logistic models as special cases. The func-
· ·
tional forms of g ( ) and G ( ) for extreme value, logistic and normal, distributions
are summarized in Table 1.
A progressively Type-II censored life-testing experiment is described in order.
Let n independent units be placed on a life-test with corresponding lifetimes T1 ,
T2 , . . ., T n that are independent and identically distributed (i.i.d.) with p.d.f. f T (t θ ) ;
;
and c.d.f. FT (t θ ), where θ denotes the vector of unknown parameters. Prior to
the experiment, the number of complete observed failures m < n and the censoring
m
scheme (R1 , R2 , . . . , Rm ), where Rj ≥ + =
0 and j=1 Rj m n are pre-fixed. During
the experiment, R j functioning items are removed (or censored) randomly from the
test when the j-th failure is observed. Note that in the analysis of lifetime data,
instead of working with the parametric model for T i , it is often more convenient to
work with the equiv = =
alent model for the log-lifetimes Xi log Ti , for i 1 , 2, . . . , n.
equivalent
= ;
The random variables X i , i 1 , 2, . . . , n, are i.i.d. with p.d.f. fX (x µ , σ ) and c.d.f.
;
FX (x µ , σ ).
c = n(n − R − 1) · · · (n − R − R − · · · − R
1 1 2 m 1− − m + 1).
The MLEs of µ and σ are the values of µ and σ which maximizes ((3 3). For location-
scal
scalee fa
fami
mily
ly of dist
distri
ribu
buti
tion
onss desc
descri
ribe
bed
d in Eq
Eqs.
s. (1) and (2), the log-li
log-likel
keliho
ihood
od functi
function
on
can be expressed as
(µ,σ) = ln L(µ,σ)
m
ln c m ln σ g
xi:m:n −µ
= + −
=
σ i 1
m
x : : − µ imn
+ R ln 1 − G i .
σ
i 1=
We denote the MLEs of the parameters µ and σ by µ µ̂ and σ̂ ˆ
σ , respectively. Computa- ˆ
tional algorithms for obtaining the MLEs of the parameters of some commonly used
location-scale distributions are available in many statistical software packages such
as R (R Core Team 2016),
2016), SAS and JMP.
The expected Fisher information matrix of the MLEs can be obtained as
∂ 2 (µ,σ) ∂ 2 (µ,σ)
E ∂µ∂µ
E ∂µ∂σ Iµµ Iµσ
I(µ,σ) = − 2
2
= . (4)
E
∂ (µ,σ)
E
∂ (µ,σ) Iµσ Iσ σ
∂µ∂σ ∂σ ∂σ
2
QD (S , M ) =I µµ Iσ σ −I µσ . (6)
We den
denote
ote the optima
optimall ex
exper
perime
iment
nt sch eme for D-op
scheme -optim
timali
ality
ty with
with a specifi
specificc mod
modelel
M as S D ∗ (M ).
[2] A-optimality
For A-opt
-optima
imalit
lity
y, we aim
aimto
to min
minimi
imize
ze the varia
variance
ncess of the estima
estimator
torss of the model
model
parameters. This can be achieved by designing an experiment that minimizes
variance-covariance matrix, tr V(µ,σ) . For a given
the trace of the asymptotic variance-covariance [ ]
experime
expe riment
nt schem =
schemee S ( R1 , R2 , . . . , Rm ) wi
with
th a spec
specifi
ificc mo dell M , the objec
mode objecti
tive
ve
function is
QA (S , M ) =V +V 11 22 . (7)
= V + [G − (δ)] V + 2G − (δ) V
11
1 2
22
1
12 , (8)
search.
For illu
For illust
stra
rati
tive
ve purp
purposose,
e, we co
cons
nsid
ider
er the
the tr
true
ue unde
underl
rlin
inee life
lifeti
time
me mode
modell of th
thee test
test unit
unitss
to be Weibull
eibull (i.
(i.e.,
e., the log-li
log-lifet
fetime
imess follo
follow
w an ex
extre
treme
me value
value distri
distribu
butio
tion, =
n, M0 E V )
and we are interested in investigating the effect of misspecification of the underline
lifetime model as log-logistic (i.e., the log-lifetimes follow a logistic distribution,
M∗ = LOGIS ). We also consider the case that the true underline lifetime model for
the test units to be lognormal (i.e., the log-lifetimes follow a normal distribution,
M0 NOR) and we are interested in investigating the effect of misspecification of
the underline lifetime model as Weibull (M ∗ E V ).
= =
3.3.1
3.3.1 Analyti
Analytical
cal Ap
Appr
proach
oach
In this
this subsec
subsecti
tion,
on, we ev
evalu
aluate
ate the sensit
sensitiv
iviti
ities
es of the optima
optimall prog
progres
ressi
sive
ve Type-II
ype-II cen-
cen-
soring scheme analytically based on the expected Fisher information matrix and the
variance-covariancee matrix of the MLEs. For the specific model M ∗ , we
asymptotic variance-covarianc
determ
det ermine
ine the optima
optimall pro
progre
gressi
ssive
ve Type-II
ype-II censor
censoring
ing sch
scheme
emess und
under
er differ
different
ent optima
optimall
∗ (M ∗ ), S∗ (M ∗ ), S∗ (M ∗ ) and S∗ (M ∗ ) fro
criteria, SD from Eqs.. (4) to (8). For the tru
m Eqs ruee
A V.95 V.05
model M0 , we also determine the optimal progressive Type-II censoring schemes
under different optimal criteria, SD ∗ (M0 ), S∗ (M0 ) and S∗ (M0 ) and S∗ (M0 ).
A V.95 V.05
Then, we can compa
comparere these experi
experiment
mental
al schemes S ∗ (M ∗ ) and S ∗ (M0 ). In addi-
schemes
tion, we compute the objective functions based on the optimal censoring scheme
specified model M ∗ while the true underline model is M0 , i.e., we com-
under the specified
pute QD (SD∗ (M ∗ ), M0 ), QA(S∗ (M ∗ ), M0 ), and QV (S∗ (M ∗ ), M0 ) for δ 0 .05 =
A δ Vδ
= =
and 0.95. The results for n 10, m 5 (1)9 with (M ∗ =
LOGIS , M0 E V ) and =
(M ∗ E V , M0
= =
NOR) are presented in Tables 2 and 3 3,, respectively.
3.3.2
3.3.2 Simulat
Simulation
ion Appro
Approach
ach
3.4 Discussions
3.4.1
3.4.1 Compari
Comparing
ng Exp
Experim
erimenta
entall Sch
Schemes
emes
)
5
0
.
0
q
ˆ 5 7 5 1 9 2 9 1 4 2 0 4 2 8
6 6 7 5 0 2 7 3 9 8 8 7 7 7
r
(
0 9 9 7 8 5 6 1 1 8 9 5 5 5
a 0
. 1
. 1
. 2
. 4
. 3
. 1
. 1
. .1 .9 .8 .8 .7 .6
V 2 1 1 0 1 1 1 1 1 0 0 0 0 0
)
5
9
.
0
q
ˆ 5 0 1 4 4 5 1 5 6 8 8 0 1 8
8 7 1 0 1 8 9 5 8 1 7 7 1 4
r
(
5 0 2 3 5 5 7 6 4 6 5 4 5 4
a 2
. 3
. 3
. 3
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0
]I 9
8 9
0 1
6 9
8 4
0 4
2 2
6 1
9 3 5 4 5 4
)
[t . . . . . . . . .1 .4 .3 .1 .0 .9
V 2 7 5 1 8 1 5 9 7 2 9 1 0 1
E e 3 4 4 4 4 5 5 5 6 7 7 9 0 2
, d 1 1
)
IS
G
O
L
( 6 1 9 4 9 0 2 9 0 5 2 4 0 5
∗ ] 6 7 9 0 0 9 5 7 8 9 1 3 7 9
S 8 0 0 3 0 8 7 6 4 3 3 1 0 8
( [V 3
. 3
. 3
. 3
. 3
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .1
V Q tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0
E
= )
0 5
0
M
.
0
q
ˆ 2 4 5 6 0 8 3 3 6 0 7 9 2 8
0 3 9 8 6 0 7 9 9 5 9 9 6 6
d r
(
0 8 2 7 5 5 6 3 1 2 0 8 8 6
n a 5
. 4
. 5
. 4
. 4
. 4
. 4
. 4
. .4 .4 .4 .3 .3 .3
a V 0 0 0 0 0 0 0 0 0 0 0 0 0 0
S
I
G )
O 5
9
L q
ˆ
.
0
6 8 3 8 9 4 0 2 4 2 5 3 3 6
= 7 6 3 3 4 7 5 4 8 8 8 8 8 9
r
(
2 5 2 0 2 0 8 2 9 8 0 1 2 6
∗ a
V 8
.
0 7
.
0 7
.
0 8
.
0 6
.
0 6
.
0 5
.
0 6
.
0 .4
0 .4
0 .5
0 .4
0 .4
0 .3
0
M
h
it )
w IS
9 ] 5 4 2 8 4 8 2 7 4 9 4 1 2 7
) G [tI
0
. 6
. 2
. 4
. 2
. 8
. 8
. 4
. .3 .9 .9 .9 .7 .1
1 O 5 4 9 3 0 9 6 7 6 2 1 0 7 4
5
(
L e 4 4 3 4 6 5 5 5 7 7 7 9 8 0
,
)
d 1
= IS
m G
, O
0 L 9 0 1 3 9 7 2 5 1 0 4 1 0 6
1 ∗
(
] 6 6 7 5 9 7 1 3 4 7 9 1 5 6
S 2 1 2 2 6 6 7 7 3 3 3 1 1 9
= ( [V 3
. 3
. 3
. 3
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2 .1
n Q tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r
o
f g
s irn
e 2 1 1 3 2 1 2 1 1
m o
s
e )) = = = = = = = = =
n
h e IS 5 5 5 6 6 6 7 7 8
sc
c R R R R R R R R R
a e G
l , , , , , , , , ,
g O 5 3 4 4 4 1 2 3 3 1 2 2 1 1
nri it m
m e ∗(L = = = = = = = = = = = = = =
p h S
so
5 2 1 3 6 2 1 3 7 1 3 8 3 9
O c s ( R R R R R R R R R R R R R R
n
e ] ] ] ] ] ] ] ]
c 5 5 5 5 5 , 5 5
, 5
e tiy 9 0 9 0 9 0 0 0
v .
0
.
0
.
0
.
0
.
0 [3
.
0
]
5
.
0[3 ]
5
.
0
ssi
l n ,] 9 ,] 9
a ]
iro = = = = = = 2 0 = [2 =
[2
. .
m 0
re
[
ti tie ] ]
δ
,
δ
, ] ]
δ
,
δ
, ,] δ
,
δ
, ,] δ ,] δ
g p r = , = ,
ro O c [1 [2 [3 [3 [1 [2 [3 [3 [1 [3 [3 [1 δ
3
[ [1 δ [3
p
l
a
itm n
/
p m
O − % % % % %
0 0 0 0 0
2 1 5 4 3 2 1
e
l
b
a m 5 6 7 8 9
T
)
5
0
.
0
q
ˆ 6 9 2 5 1 1 7 0 3 1 1 0 0 5 3 8 0 1
9 7 5 6 9 2 4 7 4 2 4 9 4 4 4 0 7 9
r
(
8 8 2 7 6 0 6 5 6 8 5 4 8 6 4 4 4 4
a 2
. 2
. 3
. 2
. 2
. 3
. 2
. 2
. .2 .2 .2 .2 .1 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
)
5
9
.
0
q
ˆ 1 0 3 1 3 4 4 2 5 7 2 7 6 1 1 4 2 6
( 5 5 9 9 3 8 4 9 3 2 3 0 2 7 1 6 5 8
r 6 1 2 8 6 6 3 4 7 2 9 0 7 8 6 6 6 5
a 4
. 6
. 4
. 3
. 4
. 3
. 3
. 3
. .3 .3 .2 .3 .3 .2 .2 .2 .2 .2
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
] 3
2 0
1 9
7 8
7 3
8 0
0 3
7 4
4 4 3 8 1 8 7 4 1 8 3
[tI . . . . . . . . .0 .2 .0 .3 .7 .4 .8 .8 .6 .7
3 1 8 3 5 9 5 5 9 1 9 9 8 5 3 3 1 1
)
R e 8 7 7 0 9 9 2 2 1 2 4 4 7 4 7 7 7 7
d 1 1 1 1 1 1 1 1 1 1 1 1 1
O
N
,
)
V
(
E 9 5 3 8 6 5 5 0 5 9 6 8 3 5 7 0 7 1
∗ ] 6 9 3 9 6 5 9 9 0 3 3 2 2 6 0 0 9 2
S 3 5 4 0 1 1 8 8 9 9 7 7 6 7 6 6 5 6
( [V 2
. 2
. 2
. 2
. 2
. 2
. 1
. 1
. .1 .1 .1 .1 .1 .1 .1 .1 .1 .1
Q rt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
R
O
N )
5
0
.
= q
ˆ
0
7 5 7 3 1 7 7 7 1 8 5 8 1 7 7 4 6 3
0 0 6 4 5 5 9 9 8 6 7 1 9 9 1 5 1 2 2
r
(
2 0 2 2 9 7 9 3 6 7 1 2 7 0 4 6 0 4
M a
V
0
.
1
0
.
2
9
.
0
8
.
0
1
.
1
7
.
0
6
.
0
7
.
0
.9
0
.6
0
.6
0
.6
0
7
.
0
.6
0
.5
0
.5
0
.6
0
.5
0
d
n
a )
V 5
9
E q
ˆ
.
0
2 5 4 7 6 8 2 0 4 6 7 8 3 2 5 2 7 7
6 8 7 8 4 7 4 3 8 3 1 4 5 3 0 5 2 5
= r
(
0 5 4 8 5 1 7 6 4 9 6 5 4 7 5 4 4 5
∗ a
V 3
.
0 2
.
0 3
.
0 2
.
0 2
.
0 3
.
0 2
.
0 2
.
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0 .2
0
M
h
it
w
9 ] 2 9 1 5 8 1 2 7 8 8 1 4 0 4 3 0 6 8
5 8 2 6 7 3 2 5 .2 .3 .2 .4 .1 .4 .6 .4 .5 .4
1
)
[tI .
4
.
2
.
4
.
3
.
7
.
2
.
4
.
2 6 2 6 5 9 4 9 8 1 8
( e 5 3 5 7 5 7 9 9 7 9 1 1 9 1 3 3 3 3
5 )V d 1 1 1 1 1 1 1
= E ,
m V )
,
0 E 8 6 8 4 7 9 2 5 7 1 5 4 0 9 6 6 3 2
1 (
∗ ] 1 6 4 9 3 4 1 0 4 6 0 9 7 3 4 3 5 6
S 9 8 9 4 7 5 2 2 3 2 0 9 0 0 8 8 8 8
= ( [V 2
. 3
. 2
. 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .1 .2 .2 .1 .1 .1 .1
n Q rt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
r
o
f
s g ))
e la irn e V
m 5 5 5 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1
e tim e (E
o
s m
h p = = = = = = = = = = = = = = = = = =
n h ∗
sc e
s (S
c 2 5 1 2 5 1 2 3 6 1 2 3 7 1 2 4 7 1
O c R R R R R R R R R R R R R R R R R R
g
rn i ]5 ]5 ]5 ]5 ]
5 ]5 ]5 ]5 ]5 ]5
tiy 9 0 9 0 9 0 9 0 9 0
s o l
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
.
0
a n
n iro ] ]
e m [2 = =
[2 = = = = = = = =
c ti tie ], ],
δ δ δ δ δ δ δ δ δ δ
e p r , , , , ] ] , , ] ] , , ] ] , ,
v O c [1 [3 [3 [1 [3 [3 [1 [2 3
[ [3 [1 [2 [3 [3 [1 [2 [3 [3
ssi
re
g
ro n
/
p m
l % % % % %
a − 0 0 0 0 0
itm 1 5 4 3 2 1
p
O
3
e
l
b
a m 5 6 7 8 9
T
3 )
d d
e
n u
a n
it
2 n
s )
e
l
5 o
0 c
(
b q
ˆ
.
0
5 5 1 4 3 3 8 0 4 6 5 0 3 7 4 9 8 6 5
a 1 0 7 0 6 9 8 5 3 9 7 8 7 3 7 7 2 7 4
T r
(
1 1 5 5 3 7 0 0 2 4 4 1 5 4 1 8 6 9 6
n a 1
. 0
. 0
. 0
. 1
. 9
. 0
. 0
. 0
. 9
. .9 .0 .9 .9 .9 .8 .8 .8 .9
i V 1 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0
d
e
t
n
se
re
p )
5
) 9
9 .
0
2 6 1 0 5 7 9 8 3 6 2 6 8 7 1 6 0 1 9
) q
ˆ
1 ( 9 3 2 8 3 2 2 7 1 5 6 5 1 7 5 7 2 4 2
(
5 ra 1
3
. 6
3
. 1
6
. 7
4
. 9
3
. 8
4
. 3
4
. 9
3
. 4
3
. 4
3
. .2
7 .2
5 .3
1 .2
7 .2
6 .2
1 .2
3 .2
8 .2
1
V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
=
m
,
0
1
=
n
( ] 6 5 5 5 4 1 8 0 8 1 1 7 5 6 0 7 5 8 1
s [tI
0
. 1
. 5
. 2
. 9
. 9
. 7
. 3
. 0
. 8
. .3 .3 .9 .1 .5 .1 .0 .8 .8
e 7 6 1 7 8 8 5 8 3 4 4 4 3 0 2 7 5 1 1
m e 4 4 3 3 3 3 4 4 5 5 6 6 6 7 7 8 8 7 8
e d
h
sc s
e
g lu
n a
ri v
so d
e
t
n
e la 1 7 6 0 3 5 5 7 0 3 0 9 2 3 1 5 1 2 5
c u ] 2 4 8 8 4 7 6 4 6 7 2 1 8 6 6 1 2 5 9
e 2 2 0 5 4 5 1 0 8 8 7 7 5 4 4 3 3 4 3
v im [V 3
. 3
. 4
. 3
. 3
. 3
. 3
. 3
. 2
. 2
. .2 .2 .2 .2 .2 .2 .2 .2 .2
ssi S tr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
re e
g m 2 1 1 3 2 1 2 1
ro e
h
p c
s = = = = = = = =
e g R
5
R
5
R
5
R
6
R
6
R
6
R
7
R
7
h
t in , , , , , , , ,
r r 5 5 5 3 4 4 4 1 2 3 4 4 3 1 2 3 3 3 3
o o
s
f n = = = = = = = = = = = = = = = = = = =
e
V C R
1
R
2
R
5
R
2
R
1
R
3
R
6
R
2
R
1
R
3
R
2 1
R R
7
R
1
R
3 2
R
3
R
6
R
1
R
E
=
0
M
r
e
d
n )
∗
)
0
u )
M M
) )
∗
s 0 0
n
o M∗ D ( (
5
9 M M
it (
∗ A S ∗
.
V
(
∗ A
(
∗ A
c y S , ) ) S ) ) S S ) )
n ilt
)
0 , )
0 ∗ ∗ , ∗ ∗ , )
0 , ∗ ∗
)
0
)
0
u ) ) ) ) ) ) )
M M MM MM M MM MM
)
f a 0 ∗ ∗ ∗ 0 ∗ 0 0
e
v itm (
5 M (
5 M (
5
(
5 MM (
5
(
5 M (
5 M (
5
(
5 MM (
5
(
5
i p 0
.
( 9
. ( 9
.
0
.
( ( 9
.
0
.
( 0
.
( 9
.
0
.
( ( 9
.
0
.
c SV ∗SD ∗SV ∗SA ∗SV ∗SV ∗SD ∗SA ∗SV ∗SV ∗SD ∗SV ∗SD ∗SV ∗SV ∗SD ∗SA ∗SV ∗SV
jt
e O ∗
b
o
e
ht
f
o
s n
e /
u
l m
a − % % %
v 0 0 0
d S 1 5 4 3
et
I
a
l G
u O
L
im
S =
∗
4
e
l
M
b h
a it m 5 6 7
T w
)
5
9
.
0
q
ˆ 9 0 6 6 3 7 3 3 7 4
( 1 2 6 4 8 6 3 6 1 3
ra 3
2
. 1
2
. 8
1
. 0
2
. 1
2
. 7
1
. .5
1 .5
1 7
.1 7
.1
V 0 0 0 0 0 0 0 0 0 0
] 0 8 9 4 9 2 4 4 7 0
7 9 8 2 0 8 .6 .8 .3 .7
[tI .
5
.
3
.
3
.
0
.
3
.
6 8 9 0 5
e 8 9 0 0 9 0 2 2 2 1
d 1 1 1 1 1 1 1
s
e
lu
a
v
d
e
t
la 2 0 7 4 5 4 9 9 1 0
u ] 2 5 1 0 4 1 3 0 2 5
2 1 1 1 1 1 9 9 9 9
im [V 2
. 2
. 2
. 2
. 2
. 2
. .1 .1 .1 .1
S tr 0 0 0 0 0 0 0 0 0 0
e
m
e 1
h
c
s =
8
g R
in
r ,
o 2 1 2 2 2 2 1 1 1 1
s = = = = = = = = = =
n
e 8 3 2 5 7 1 1 2 8 9
C R R R R R R R R R R
)
∗ ,
)
M ∗
(
5
9
.
M
(
∗ V ∗ V
S S
,
)
,
)
∗ ∗
M
(
M
(
∗ A ∗ A
y S ) S
ilt ,
)
∗
)
0
)
0
)
0 ,
)
)
0
) )
M MMM M
) )
a ∗ 0 0 0 0 ∗
itm M
(
(
5
0
.
MM
( (
(
5
9
.
(
5
0
.
(
5
0
.
MMM
( ( (
(
5
9
.
p
O SD ∗SV ∗SD ∗SA ∗SV ∗SV ∗SV ∗SD ∗SA ∗SD ∗SV
∗
n
/
m
− % %
) 0 0
d 1 2 1
e
u
n
it
n
o
c(
4
e
l
b
a m 8 9
T
3.4.2 Compari
Comparing
ng Values of Objecti
Objective
ve Functions
Functions
By comp
compar arin
ing
g thethe value
aluess of the
the obje
object
ctiive fu
func
ncti onss Q(S ∗ (M ∗ ), M0 ) with
tion
Q(S ∗ (M ∗ ), M ∗ ) and Q(S ∗ (M0 ), M0 ), we can observe that the model misspeci-
fication has a more substantial effect for V -optimality and a relatively minor effect
for A-op
-optim
timal
ality
ity.. For instan
instance
ce,, we com paree Q(S ∗ (M ∗ ), M0 ) and Q(S ∗ (M0 ), M0 ) by
compar
= =
considering n 10 and m 5, the optimal censoring scheme for V -optimality with
δ = under M ∗
0 .95 under =LOGIS is(4,0,0,0,1)with QV.95 (S ∗ (LOG LOGISIS ), EV
E V ) 0 .3211 =
M
19.50% lossEofV efficient.
(Table 0 =
with Q2V).95
, while E Vthe
(S ∗ (EV optimal
), E
EVV ) 0censoring
= .2585 which scheme
givesunder is (0, 0, 0,In 0, 5)
con-
=
trast, consider n 10 and m
∗
= 5, the optimal censoring scheme for A-optimality
under M = LOGIS is (0, 3, 0, 0, 2) with QA (S ∗ (LOG LOGIS =
E V ) 0 .3071 (Table 2),
IS ), EV
while
whi le the opt
∗
optima
imall censor
censoring
ing scheme underr M0
scheme unde = E V is (0, 5, 0, 0, 0) with
QA (S (EVE V ), EV =
E V ) 0 .2918 (Table 3) which gives 4.98% loss of efficient. We have
a similar observation when we compare the objective functions Q(S∗ (M ∗ ), M ∗ )
and Q(S ∗ (M ∗ ), M0 ). For example, in Table 3, when n 10 = 10,, m =6, the optimal
censoring scheme for V -optimality with δ = 0 .95 under M ∗ E V is (0, 0, 0, 0, 4)
=
with Q V.95 (S ∗ (EV
E V ), E
EV =
V ) 0 .2546. If the censoring scheme (0, 0, 0, 0, 4) is applied
=
when the true model is normal (M0 NOR ), the asymptotic variance of the esti-
mator of 95-th percentile is V ar (q̂ ˆ =
q0.95 ) 0 .4633, which is clearly not the minimum
variance that can be obtained because the censoring scheme (4, 0, 0, 0, 0) yields
V ar (q̂
q0.95 ) 0 .3684. Based on the results from our simulation studies, one should
ˆ =
be cautious when the quantity of interest is one of those extreme percentiles (e.g.,
1-st, 5-th, 95-th, 99-th percentiles) because the optimal censoring schemes could be
sensitive to the change of the model.
Base
Ba sed
d on the
the si
simu
mula lati
tion
on appr
approa
oach
ch,, we obse
observ
rvee th
that
at th
thee optim
optimal al cens
censororin
ingg sche
scheme
mess
determined based on asymptotic theory of the MLEs may not be optimal even when
the underline model is correctly specified. Since the Monte-Carlo simulation is a
numerically mimic of the real data analysis procedure in practice, the results are
showing that when the analytical value of the objective function of the optimal cen-
soring scheme and the values of the objective functions of other censoring schemes
are closed, it is likely that those non-optimal censoring schemes will perform better
than
than the
the opti
optima
mall cens
censor orin
ingg sc
sche
heme
me.. We woul
wouldd su
sugg
ggesestt th
thee pr
prac
acti
titi
tion
oner
erss to use
use Mo
Montnte-
e-
Carlo
Carl o si
simu
mula
lati
tion
on in co
comp
mpar
arin
ing
g wi
with
th othe
otherr pro
progr
gres
essi
sive
ve cens
censor
orin
ing
g sche
scheme
mess an
and
d choo
choose
se
the optimal one. However, since the number of possible censoring schemes can be
numerous when n and m are large, it will not be feasible to use Monte-Carlo simu-
lation to compare all the possible censoring schemes. Therefore, in practice, we can
use the analytical approach to identify the optimal censoring scheme and some near
optimal censoring schemes, then Monte-Carlo simulation can be used to choose the
best censoring schemes among those candidates. This approach will be illustrated in
the example which will be presented in the next section.
4 Illust
Illustrat
rativ
ivee Examp
Example
le
R Core Team (2016
(2016)) presented a progressively Type-II censored sample based on
the breakdown data on insulating fluids tested at 34 kV from Nelson (1982). The
progressively
progressively censored data presented in R Core TTeam
eam (2016 =
(2016)) has n 19 and m 8 =
with censoring scheme (0, 0, 3, 0, 3, 0, 0, 5). Suppose that we want to re-run the
same experiment with n =
19 and m =
8 and we are interested in using the optimal
censoring scheme that minimizing the variances of the parameter estimators (i.e.,
A-optimality) or minimizing the variance of the estimator of the 95-th percentile of
the lifetime distribution (i.e., V -optimality with δ = 0 .95). We can first identify the
top k optimal censoring schemes based on the asymptotic variances and then use
Monte-Carlo simulation to evaluate the performances of those censoring schemes.
Sinc
Sincee R Core eam (2016)
Core Team 2016) discus
discussed
sed the linear
linear infere
inference
nce under
under progre
progressi
ssive
ve Type
ype--
II censoring when the lifetime distribution is Weibull and used the breakdown data
on insul
insulating
ating fluids as a numer
numerical
ical examp
example, le, we assume
assume here the underline
underline lifetime
lifetime
dist
distri
ribu
buti
tion
on to be Weibu
eibull
ll an
and
d de
dete
term
rmin
inee the
the top
top ten
ten cens
censor
orin
ing
g ssch
chem
emes
es su
subj
bjec
ectt to th
thee
=
A-optimality and V -optimality with δ 0 .05. To study the effect of model misspec-
ification
ifica tion to the optimal censoring scheme
schemes, s, we compu
computete the objective
objective functions for
these censoring schemes when the true underline lifetime distribution is lognormal.
Then, we also use Monte-Carlo simulation to evaluate the performances of these
)
R
O
N
,
)
) V
d 3 0 4 8 5 6 9 9 0 6
te 5 E 6 0 5 1 8 5 3 3 5 9
1 2 1 2 1 2 1 2 2 2
la V ∗S
9 (
3 3 3 3 3 .3 .3 .3 .3 .3
.
u Q ( . . . . .
0 0 0 0 0 0 0 0 0 0
im
(S )
M R
s R O
n N
o
it O
N ,
)
al V 0 5 5 4 3 4 0 3 5 3
u = 3 3 3 2 2 3 3 4 2 3
0 E 5 5 5 5 5 5 5 5 5 5
A (
sm
i M Q (S 1
0
. 1
0
. 1
0
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
. 0
1
.
0
0 )
0, R
0 O
0 N
1 ) ,
itc
)
d V 9 4 1 2 5 5 2 3 4 0
n to 5 E 7 1 8 4 1 6 8 4 8 0
a p V∗
9 ( 5 6 5 6 6 6 5 6 6 7
s .
m Q (S 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
e 0 0 0 0 0 0 0 0 0 0
c y
s
n
a A
ri ( )
a M R
v R O
cit O N
o N ,
)
t V 7 5 3 3 0 2 8 9 1 0
p =
E 9
3
9
3
0
4
9
3
0
4
9
3
0
4
9
3
9
3
9
3
m 0
A S
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
y
s M Q ( 0 0 0 0 0 0 0 0 0 0
a
n
o )
d V
e E
,
sa b
)
V 2 8 6 5 7 5 7 1 8 2
8 ) 5 E 2 7 2 9 5 1 1 7 4 6
d V9 ∗
( 0 0 0 0 0 1 0 0 1 1
2 2 2 2 2 .2 .2 .2 .2 .2
.
= e
t Q (S . . . . .
0 0 0 0 0 0 0 0 0 0
m la
, u
9 im
1
(S )
V
=
n V E
E ,
h )
3 9 4 5 2 1 6 7 7 8
it = V 0 1 0 0 9 0 1 9 0 1
w ∗ E 9 9 9 9 8 9 9 8 9 9
y A (
S 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
ti M Q ( 0 0 0 0 0 0 0 0 0 0
l
a
itm )
p V
o E
- ,
A V
)
r ) 9 1 2 8 3 9 5 0 3 9
o c 5 E 0 8 2 5 9 3 3 7 2 0
f ti 9 ( 9 8 9 8 8 8 9 8 8 8
V ∗ 1 1 1 1 1 .1 .1 .1 .1 .1
.
e to Q (S . . . . .
m p 0 0 0 0 0 0 0 0 0 0
e
h m
y
sc
s
(A )
g V
n V E
ri E ,
so
)
= V 0 2 2 3 4 5 5 6 6 7
n 7 7 7 7 7 7 7 7 7 7
e ∗ E 7 7 7 7 7 7 7 7 7 7
c A S
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
l M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm
p ) ) )
o 0 0 0 ) ) ) ) ) ) )
e , , , 0 0 0 0 0 0 0
n 0 0 0 , , , , , , ,
et m ,
0
,
0
,
0
0
,
0
,
0
,
0
,
0
,
0
,
0
,
e , , , 0 0 0 0 0 0 0
p h
c 0 0 0 , , , , , , ,
o s , , , 0 0 0 0 0 0 0
T g 0 0 0 , , , , , , ,
, , , 0 0 0 0 0 0 0
5 irn 0 1 0 , , , , , , ,
o , , , 2 1 3 0 2 4 5
le s 1 0 0 , , , , , , ,
n 1 1 1 9 9 8 9 8 7 6
b e , , , , , , , , , ,
a
T C (0 (0 (1 (0 (1 0
( 2
( 1
( 0
( 0
(
M R
0 R O
0 O N
0, N ,
)
0 V 9 4 4 6 9 0 9 5 1 6
0 = 4 6 4 6 4 5 4 4 5 4
1 0 E
(
6 6 6 6 6 6 6 6 6 6
n d M Q S
A ( 1
.
0 1
.
0 1
.
0 1
.
0 1
.
0 .1
0 .1
0 .1
0 .1
0 .1
0
a
s )
R
e
c O
n N
a
ri )
itc
)
,
a V 1 9 5 9 9 6 0 0 4 4
v to 5 E 1 9 8 7 6 6 5 5 5 4
c it p V∗
9 ( 9 8 8 8 8 8 8 8 8 8
3 3 3 3 3 .3 .3 .3 .3 .3
.
o m Q (S . . . . .
t y 0 0 0 0 0 0 0 0 0 0
p s
A
m (
sy
)
M R
a R O
n O N
o N ,
)
d V 2 0 7 5 3 2 9 9 0 7
se
= 8 8 7 7 7 7 6 6 7 6
0 E 5 5 5 5 5 5 5 5 5 5
a A S
M Q
( 1
. 1
. 1
. 1
. 1
. .1 .1 .1 .1 .1
b ( 0 0 0 0 0 0 0 0 0 0
8
= )
m V
E
,9
1
,
)
V 5 1 6 6 0 0 8 3 0 3
= ) 5 E 9 6 1 1 0 8 6 7 2 6
9 ( 8 8 8 8 8 7 7 7 8 7
n d V ∗
.
3 3 3 3 3 .3 .3 .3 .3 .3
, te Q (S
.
0
.
0
.
0
.
0
.
0 0 0 0 0 0
5
9 la
. u
0 im
= (S )
V
δ V E
h E ,
it =
)
V 1 4 9 7 4 7 2 6 7 0
w ∗ 3 1 9 8 8 7 7 7 9 7
E 5 5 4 4 4 4 4 4 4 4
y A (
S 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
ilt M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm )
p V
o - E
,
V V
)
r ) 6 7 9 0 1 1 3 3 3 4
o c 5 E 5 5 5 6 6 6 6 6 6 6
f ti 9 ( 4 4 4 4 4 4 4 4 4 4
V ∗ 1 1 1 1 1 .1 .1 .1 .1 .1
.
e to Q (S . . . . .
p 0 0 0 0 0 0 0 0 0 0
me
h m
y
sc
s
(A )
g V
n V E
ri E ,
so
)
= V 7 1 4 3 7 7 8 9 4 8
n 2 2 1 1 0 0 9 9 0 9
e ∗ E 3 3 3 3 3 3 2 2 3 2
c A S
( 2
. 2
. 2
. 2
. 2
. .2 .2 .2 .2 .2
l M Q ( 0 0 0 0 0 0 0 0 0 0
a
itm
p ) ) ) )
o 0 0 ) 0 ) ) ) ) 0 )
e , , 0 , 0 0 0 0 , 0
n 1 0 , 0 , , , , 0 ,
et m 1
,
1
,
9
,
1
,
8
,
9
,
7
,
8
,
1
,
9
,
e 0 1 2 0 3 1 4 2 0 0
p h
c , , , , , , , , , ,
o s 0 0 0 1 0 1 0 1 0 2
T g , , , , , , , , , ,
0 0 0 0 0 0 0 0 1 0
6 irn , , , , , , , , , ,
o 0 0 0 0 0 0 0 0 0 0
le s , , , , , , , , , ,
n 0 0 0 0 0 0 0 0 0 0
b e , , , , , , , , , ,
a
T C (0 (0 (0 (0 (0 0
( 0
( 0
( 0
( 0
(
#######################################
# Input values: #
# nn: Sample size #
# mm: Effective sample size #
# ir:
ir: Cens
Censor
orin
ing
g sche
scheme
me (leng
(length
th = mm)
mm) #
#######################################
###################################################
# Output values: #
# dfi:
dfi: Det
Deter
ermi
mina
nant
nt of Fish
Fisher
er inf
infor
orma
mati
tion
on mat
matri
rix
x #
# tvar: Trace of variance-covariance matrix #
# vq95
vq95:
: Vari
Varian
ance
ce of the ML
MLE
E of 95-t
95-th
h per
percent
centil
ile
e #
# vq05
vq05:
: Vari
Varian
ance
ce of
of th
the
e ML
MLE
E of 5
5-t
-th
h perc
percen
enti
tile
le #
###################################################
objp
objpcs
cs <- func
functi
tion
on(m
(mm,
m, nn,
nn, ir)
ir)
{
rr <- num
numeri
eric(m
c(mm)
m)
cc <- num
numeri
eric(m
c(mm)
m)
aa <- matr
matrix
ix(0
(0,
, mm,
mm, mm)
mm)
epcos
epcos <- num
numeri
eric(m
c(mm)
m)
epcoss
epcossq
q <- num
numeri
eric(m
c(mm)
m)
rpcs
rpcs <- num
numeri
eric(m
c(mm)
m)
gg <- num
numeri
eric(m
c(mm)
m)
##Compute
##Compute rr##
for
for (jj
(jj in 1:mm
1:mm)
)
{rr[
{rr[jj
jj]
] <- mm - jj + 1 + sum(
sum(ir
ir[j
[jj:
j:mm
mm])
])}
}
##Compute
##Compute aa##
for
for (jj
(jj in 1:mm
1:mm)
)
{for
{for (ii
(ii in 1:jj
1:jj)
)
{aa[
{aa[ii
ii,j
,jj]
j] = 1
for
for (kk
(kk in 1:jj
1:jj)
)
{if
{if (kk
(kk != ii)
ii) {aa[
{aa[ii
ii,j
,jj]
j] <- aa[i
aa[ii,
i,jj
jj]/
]/(r
(rr[
r[kk
kk]
] - rr[i
rr[ii]
i])}
)}
} }}
##Comp
##Compute
ute E(Z
E(Z_i:
_i:m:n
m:n)
) and E(Z
E(Z_i:
_i:m:n ∧
m:n 2) ##
for
for (ii
(ii in 1:mm
1:mm)
)
{psu
{psum
m <- 0
psum
psumsq
sq <- 0
for
for (ll
(ll in 1:ii
1:ii)
) {psu
{psum
m <- psum
psum + aa[l
aa[ll,
l,ii
ii]*
]*(d
(dig
igam
amma
ma(1
(1)
) -
log(rr[ll]))/(rr[ll])}
for
for (ll
(ll in 1:ii
1:ii)
) {psu
{psums
msq
q <- psum
psumsq
sq + aa[l
aa[ll,
l,ii
ii]*
]*(
(
∧
(digamma(1) 2) - 2*diga
2*digamma(1)
mma(1)*log(r
*log(rr[ll]
r[ll])
)
+ log(rr
log(rr[ll])*
[ll])*log(rr
log(rr[ll])
[ll]) + pi*pi/6)/rr[
pi*pi/6)/rr[ll]}
ll]}
epcos[ii]
epcos[ii] <- cc[ii]
cc[ii]*psum
*psum
epcoss
epcossq[i
q[ii]
i] <- cc[
cc[ii]
ii]*ps
*psums
umsq
q }
##Elem
##Element
ents
s of Fis
Fisher
her Inf
Inform
ormati
ation
on Mat
Matrix
rix##
##
i22
i22 <- sum(
sum(1
1 + 2*ep
2*epco
cos
s + epco
epcoss
ssq)
q)
i12
i12 <- -sum
-sum(1
(1 + epco
epcos)
s)
i11
i11 <- mm
dfi
dfi <- det(
det(ma
matr
trix
ix(c
(c(i
(i11
11,
, i12,
i12, i12,
i12, i22)
i22),
, 2, 2))
2))
vcov
vcov <- solv
solve(
e(ma
matr
trix
ix(c
(c(i
(i11
11,
, i12,
i12, i12,
i12, i22)
i22),
, 2, 2))
2))
inv05 <- log(-l
log(-log(0.9
og(0.95))
5))
inv95 <- log(-l
log(-log(0.0
og(0.05))
5))
tvar
tvar <- vcov
vcov[1
[1,1
,1]
] + vcov
vcov[2
[2,2
,2]
]
vq95
vq95 <- vco
vcov[1
v[1,1]
,1] + inv95*
inv95*inv
inv95*
95*vco
vcov[2
v[2,2]
,2] + 2*inv9
2*inv95*v
5*vcov
cov[1,
[1,2]
2]
vq05
vq05 <- vco
vcov[1
v[1,1]
,1] + inv05*
inv05*inv
inv05*
05*vco
vcov[2
v[2,2]
,2] + 2*inv0
2*inv05*v
5*vcov
cov[1,
[1,2]
2]
out
out <- c(df
c(dfi,
i, tvar
tvar,
, vq95
vq95,
, vq05
vq05)
)
names(
names(out
out)
) <- c("
c("dfi
dfi",
", "tv
"tvar"
ar",
, "vq
"vq95"
95",
, "vq
"vq05"
05")
)
return(out)}
## Exam
Exampl
ple:
e: n = 10,
10, m = 5, cens
censor
orin
ing
g sche
scheme
me = (5,0
(5,0,0
,0,0
,0,0
,0)
)
objpcs(5,
objpcs(5, 10, c(5,0,
c(5,0,0,0,0
0,0,0))
))
dfi tvar vq95 vq05
54.2
54.214
1445
4581
81 0.29
0.2947
4796
966
6 0.34
0.3473
7380
800
0 0.9
0.92473
247358
58
References
Balakrishnan, N., & Aggarwala, R. (2000). Progressive censoring: Theory, methods and applica-
tions. Boston: Birkhäuser.
Abstract Missing observations are a common occurrence in public health, cli clinical
nical
studies and social science research. Consequences of discarding missing observa-
tions, sometimes called complete case analysis, are low statistical power and poten-
tially
tially biase
biased
d estimate
estimates.
s. Fully Bayesian methods using Markov Chain Monte-Car
Monte-Carlo lo
(MCMC) provide an alternative model-based solution to complete case analysis by
treating missing values as unknown parameters. Fully Bayesian paradigms are natu-
rally
rally equipp
equipped
ed to handle
handle this
this situat
situation
ion by augmen
augmenti
ting
ng MCMC
MCMC routin
routines
es wit
with
h additi
additiona
onall
layers and sampling from the full conditional distributions of the missing data, in
the case of Gibbs sampling. Here we detail ideas behind the Bayesian treatment
of missing data and conduct simulations to illustrate the methodology. We consider
specifi
specifical
cally
ly Bay
Bayesi
esian
an mul
multi
tiva
varia
riate
te re
regre
gressi
ssion
on with
with missin
missing
g respon
responses
ses and the missin
missing
g
covariate setting under an ignorability assumption. Applications to real datasets are
provided.
1 Intr
Introd
oduc
ucti
tion
on
Complete data are rarely available in epidemiological, clinical and social research,
especially when a requirement of the study is to collect information on a large num-
ber of individuals or on a large number of variables. Analyses that improperly treat
missing data
alizability of can leadtotoa more
results widerbias and lossand
population of efficiency, which
can diminish our may limit
ability gener-
to under-
stand true underlying phenomena. In applied research, linear regression models are
an important tool to characterize relationships among variables. The four common
appr
ap proa
oach
ches
es for
for infe
infere
renc
ncee in re
regr
gres
essi
sion
on mode
models
ls with
with miss
missin
ing
g data
data are:
are: Ma
Maxi
ximu
mumm like
like--
H. Rochani (B)
Department of Biostatistics, Jiann-Ping Hsu College of Public Health,
Georgia Southern University, Statesboro, GA, Georgia
e-mail: [email protected]
[email protected]
D.F. Linder
Department of Biostatistics and Epidemiology, Medical College of Georgia,
Augusta University, Augusta, GA, Georgia
lihood (ML), Multiple imputation (MI), Weighted Estimating Equations (WEE) and
Fully Bayesian
Bayesian (FB) (Little and Rubin 2014). 2014). This chapter focuses on FB methods
for regression models with missing multivariate responses and models with miss-
ing covariates. The Bayesian approach provides a natural framework for making
inferences about regression coefficients with incomplete data, where certain other
methods may be viewed as special cases or related. For instance, the maximum a
posteriori (MAP) estimate from a FB approach under uniform improper priors leads
to ML es
esti
tima
matetes;
s; ther
theref
efor
oree ML can
can be vie
viewed
wed as a sp
spec
ecia
iall case
case of Ba
Baye
yesi
sian
an in
infe
fere
rencnce.
e.
More
Mo reo
ove
verr, in MI the
the “imp
“imput
utat
atio
ion”
n” st
step
ep is base
based
d on th
thee samp
sampli ling
ng fr
from
om a po
post
ster
erio
iorr pr
pre-
e-
dictive distribution. Overall, FB methods are general and can be powerful tools for
dealing with incomplete data, since they easily accommodate missing data without
having extra modeling assumptions.
2 Missin
Missing
g Data
Data Mecha
Mechanis
nisms
ms
For researchers,
researchers, it is crucial to hav
havee some understa
understanding
nding of the underlying
underlying missing
missing
mechan
mec hanism
ism for the varia
variable
bless under
under inve
investi
stigat
gation
ion so that
that par
parame
ameter
ter est
estima
imate
tess are accu-
accu-
rate and precise. Rubin (1976)
1976) defined the taxonomy of missing data mechanisms
based on how the probability of a missing value relates to the data itself. This tax-
onomy has been widely adopted in the statistical literature. There are mainly three
types of missing data mechanisms:
Missing Completely At Random (MCAR):-
(MCAR):-
MCAR mechanisms assume that the probability of missingness in the variable of
interest does not depend on the values of that variable that are either missing or
obser
obs erve
ved.
d. We begi
beginn by deno
denoti ting
ng the
the data as D and M as th
data thee miss
missin
ing
g in
indi
dica
cato
torr matr
matrix
ix,,
which has values 1 if the variable is observed and 0 if the variable is not observed.
For MCAR, missingness in D is independent of the data being observed or missing,
or equiv alently p ( M | D , θ) = p ( M |θ), where θ are the unknown parameters.
equivalently
Missing At Random (MAR):-
(MAR):-
A MAR mechanism assumes that the probability of missingness in the variable
of interest is associated only with components of observed variables and not on
the
the co
comp
mpononen
ents
ts that
that are
are miss
missin
ing.
g. In math
mathem
emat
atic
ical
al term
terms,
s, it can
can be wr
writ
itte
ten
n as
p ( M | D , θ) = p ( M | Dobs , θ).
3 Data
Data Augmen
ugmentat
tation
ion
Data augmentation (DA) within MCMC algorithms is a widely used routine that
handles incomplete data by treating missing values as unknown parameters, which
are sampled during iterations and then marginalized away in the final computa-
tion of various functionals of interest, like for instance when computing posterior
means (Tanner and Wong 1987
1987).
). To
To illustrate how inferences from data with missing
(t ) (t −1)
1. D(tmiss
)
∼ p Dmiss | D(obs
t) , θ
2. θ ∼ p θ| Dobs , D miss
where θ may be sampled by using one of the various MCMC algorithms discussed
in the
the pre
previou
viouss chap
chapte
ters
rs of this
this boo
book.
k. A st
stat
atio
iona
nary
ry di
dist
stri
ribu
buti
tion
on of th
thee abov
abovee tr
tran
ansi
siti
tion
on
kernel is p (θ, D miss | Dobs ) where upon marginalization over the imputed missing
values one arrives at the desired target density p (θ| Dobs ). In the following
following section,
section,
we detail how these ideas may be implemented with missing multivariate responses
in the context of Gibbs sampling.
4 Miss
Missin
ing
g Resp
Respon
onse
se
Our mul
multi
tiva
varia
riate
te re
regre
gressi
ssion
on model
model for respon
response vectorr Y i = (Yi 1 , Yi 2 , . . . , Yi d ) with
se vecto
i nd
covariate vector X i = X i 1 , X i 2 , . . . , X i p can be written as Y i | X i ∼ N µi , i ,
where µi is a d × 1 vect
vectoror and is a d × d matrix. Furthermore
and Furthermore,, µi j = E ( Y i | X i ) =
X i β where β is a p × 1 vector of regression coefficients. The key assumption for
this particular model is that the measured covariate terms X i k are the same for each
component of the observations Yi j where 1 ≤ j ≤ d . Additionally, since we are
assuming that β ∈ R p , the mean response for observation i is the same for each 1 ≤
j ≤ d . Both
Both the
the assu
assump
mpti tion
on of comm
common on base
baseli line
ne cova
covari
riat
ates
es an
and
d a vector β can easily
vector easily
be extended to time varying covariates and a matrix of coefficients β ∈ Rd × p with
only minor changes to notation; however, we use the current scenario for illustration
purposes. Since we are only focusing on the ignorable missing mechanism, we will
consider the scenario where missing in Y depends on non-missing predictors.
define, Y i (obs ) = (Yi 1 , Yi 2 , . . . , Yi d ∗ ) and Y i (miss ) = Yi (d ∗ +1) , Yi 2 , . . . ,
We define,
Yi d ) , where d ∗ < d . The variance-covariance matrix, , can be partitioned as
=
(obs ,miss ) (miss )
For the Bayesian solution to the missing data problem, after we specify the com-
plete data model and noninformative independence Jeffreys’ prior for parameters,
d +1
p (β , ) ∝ | |− 2 , we would then like to draw inferences based on the observed
posterior p (β , |Y obs , X ). Ho
data posterior Howe
weve
verr, as discus
discussed
sed pre
previo
viousl
usly
y, the comple
complete
te dat
dataa
posterior p (β , |Y , X ) is easier to sample from than the observed data posterior. In
this particular situation, full conditional distributions for complete data and parame-
ters of interest are easy to derive. For posterior sampling using data augmentation,
(t )
at each iteration t , samples from Y (miss ) , β (t ) , (t ) |Y obs , X can be obtained by
(t )
firstt sampli
firs ng Y i (miss ) | · · · ∼ p Y i (miss ) |Y i (obs ) , β (t −1) , t −1) , X for 1 ≤ i ≤ n and
sampling
Markov Chain Monte-Carlo Methods for Missing … 133
then sampling β ,
(t ) (t )
|···∼ p
(t )
β , |Y (miss ) , Y (obs ) ,
X . In the data augmen-
tation step, p Y i (miss ) |Y i (obs ) , β (t −1) , (t −1) , X is a normal density with mean µi∗
and variance
variance-covaria nce matrix ∗i , where
-covariance
and vec(x )d denotes the scalar value x stacked in a vector of length d . Samples
(t )
from p β, |Y (miss ) , Y (obs ) , X can be obta
obtain
ined
ed by Gibb
Gibbss samp
sampli
ling
ng sinc
sincee th
thee
full conditional posteriors for each parameter are analytic and can be written as
p (β |. . . ) ∼ N µβ , V β and p ( | . . . ) ∼ W −1 (nd + d , ψ ), where
−1
−1
µβ = (X
i Xi ) ( X i −1 Y i )
i i
−1
−1
Vβ = (X
i Xi )
i
ψ= (Y i − X i β )(Y i − X i β )
i
In the above notation, the bold symbol X i represents the d × p matrix with rows X i ,
N denotes a multivariate normal density and W −1 the inverse
inverse Wishart
Wishart density.
density.
4.2 Simulation
A simulation study was conducted to compare the bias and root mean squared
error (RMSE) of regression coefficients (β ) for complete case analysis and the
FB approach under a MAR assumption for our multivariate normal model. Data
augmentation using Gibbs sampling was performed to compare the performance of
the estimators by using various proportions of missing values in the multivariate
response variable as shown in Tables 1 and 2. In the simulation, at each iteration,
three covariates ( X 1 , X 2 , X 3 ) were generated, in which X 1 was binary, and X 2 , X 3
were continuous. X 1 was sampled from the binomial distribution with success prob-
abil
ab ilit
ity
y of 0.
0.4,
4, whilee X 2 and X 3 were
whil were sa
samp
mpleled
d from
from th
thee norm
normal
al di
dist
stri
ribu
buti
tion
on with
with mean
mean
= 0 variance = 1. Furthe
Furthermo
rmore,
re, the multi
multiva
varia
riate
te res
respon
ponsese varia
variable
ble,, Y , was
was ge
gene
nera
rate
ted
d
from N (β0 + β1 x1 + β2 x2 + β3 x3 , ) where
424
= 242
224
134 H. Rochani and D.F. Linder
β0 = 0 and β1 = β2 = β3 = 1. The sample size for each iteration was 100. Various
proportions of missing values were created in Y1 , Y2 and Y3 that depend only on
non-missing X = ( X 1 , X 2 , X 3 ) in order to simulate a MAR missing mechanism. To
model
mod el the missin
missing
g probabi
probabilit
lity
y for the varia
variable P r ( Yi 1 = missing| X ), we co
ble Y1 ; Pr cons
nsid
ider
er
the logistic model as follows;
ex p (γ0 + X i 1 + X i 2 + X i 3 )
P r ( Yi 1 = missing| X ) = = pi .
1 + ex p (γ0 + X i 1 + X i 2 + X i 3 )
Markov Chain Monte-Carlo Methods for Missing … 135
proportions of missing values of the response variable. However, the RMSEs are
smal
sm alle
lerr for
for DA as comp
compar
ared
ed to the
the co
comp
mple
lete
te case
case an
anal
alys
ysis
is un
unde
derr di
diff
ffer
eren
entt pr
propo
oport
rtio
ions
ns
of missingness in the response variable (Table 2).
This section focuses on a real data application of the fully Bayesian approach for
analyzing the multivariate normal model as discussed in the previous section. We
will illustrate the applicat
application
ion by using the prostate speci
specific
fic antigen (PS
(PSA)
A) data which
which
was published by Etzioni et al. al. (1999).
1999). This was a sub study of the beta-carotene
and retinol trial (CARET) with 71 cases of prostate cancer patients and 70 controls
(i.e. subjects not diagnosed with prostate cancer by the time of analysis, matched to
cases on date of birth). In the PSA dataset, in addition to baseline age, there were
two biomarkers measured over time (9 occasions): free PSA (fpsa) and total PSA
(tpsa). For illustrative purposes, we investigate the effect of baseline age on the first
three fpsa measurement in patients. There were missing values in the fpsa variable at
occasion 2 and 3 for 14 patients in the study. Under the assumption of missingness
being dependent only on the fully observed baseline age, we can obtain estimates
of the regression coefficient for age βage = 0.0029 with S E = 0.00026 using the
fully Bayesian modeling approach. A similar model fit with complete-case analysis
gives estimates for baseline age as 0.0061 with S E = 0.00033.
Markov Chain Monte-Carlo Methods for Missing … 137
5 Missin
Missing
g Cova
Covaria
riates
tes
In ad
addiditi
tion
on to mi
miss
ssin
ing
g ou
outc
tcom
omeses bein
being
g a co
comm
mmon on occu
occurr
rren
ence
ce in expe
experi
rime
mentntal
al stud
studie
ies,
s,
missing covariates are frequently encountered as well. In this section, we focus on
miss
mi ssin
ingg cova
covari
riat
ates
es in whic
which h the
the mi
miss
ssin
ingn
gnes
esss depe
depend
ndss on th
thee fu
full
lly
y obse
observ
rveded re
resp
spon
onse
se..
We sp
spececia
iali
lize
ze our
our an
anal
alys
ysis
is to the
the norma
normall re
regre
gressssio
ion
n mode
model.l. We didire
rect
ct th
thee re
read
ader
er to th
thee
multiple methods
methods proposed in the literature for handling missing covariates
covariates (Ibrahim
et al.
al. 1999a,
1999a, 2002;
2002; Lipsitz and Ibrahim 1996;1996; Satten and Carroll 2000;2000; Xie and Paik
1997).
1997).
5.1 Method
µi = Xi j β j +
Mi j β j (1)
j ∈ Ri j ∈ Ric
where Yi ∼ N µi , σ 2 . Ou Ourr go
goal
al is to esti
estima
matete re
regr
gres
essi
sion
on coef
coeffic
ficie
ient
ntss in th
thee pr
pres
esen
ence
ce
of mi
miss
ssin
ing
g da
data
ta.. For nota
notatition
on purp
purpososeses,, we ca
cann writ
writee th
thee co
coll
llec
ecti
tion
on of miss
missiningg data
data as
parameter,, M = Mi 1 , Mi 2 . . . Mi p , with
the parameter with cor
corres
respon
pondin
ding prior p ( M ). Ea
g prior Each
ch of the
missing
missing param
parameter s, M i j , will
eters, will be assi
assign
gneded a pr
prio
iorr di
dist
stri
ribu
buti on p Mi p . The
tion The comp
complelete
te
data posterior p β , σ 2 , M |Y , X can be determined by Bayes rule as follows:
p β , σ , M L Y |β , σ 2 , M , X
2
2
p β , σ , M |Y , X = (2)
p β , σ 2 , M L Y |β , σ 2 , M , X d β d σ 2 d M
The posterior in Eq. 2 depends on the missing covariate parameters M . However,
our main interest is in posterior inference about β and σ 2 , and the desired posterior
posterior
distribution p β , σ 2 |Y , X can be obtai
obtained
ned by
p β , σ 2 |Y , X =
p β , σ 2 , M |Y , X d M
(3)
In gene
genera ral, Eq. 3 invo
l, Eq. involv
lves
es multi
multi-di
-dimen
mensio
sional
nal integ
integral
ralss which
which do not ha
have
ve closed
closed for
forms
ms
and will be high-dimensional even for low fractions of missing covariate values. In
fact, the dimension of the integration problem is the same as the number of missing
valu
values
es.. Post
Postererio
iorr samp
sampli
ling
ng can
can be perf
perfor
orme
medd ba
base
sed
d on vario
arious
us MCMC
MCMC meth
method
ods,
s, su
such
ch
as via the Gibbs sampler after specifying full conditionals or using a random walk
An important issue in our current setting is the appropriate prior specification for
missing covariates. There are many ways to assign missing covariate distributions;
however,
howev er, some modeling strategies, especially in presence of large fractions of miss-
ing data, can lead to a vast number of nuisance parameters. Hence, MCMC methods
to sample posteriors can then become computationally intensiv intensivee and inefficient
inefficient even
if the
the pa
para
rame
mete
ters
rs are
are iden
identi
tifia
fiabl
ble.
e. Th
Thus
us,, it is esse
essent
ntia
iall to re
redu
duce
ce th
thee numb
numberer of th
thee nui-
nui-
sance parameters by specifying effective joint covariate distributions. Ibrahim et al.
(1999b
1999b);
); Lipsitz and Ibrahim (1996
(1996)) proposed the strategy of modeling the joint
distribution of missing covariates as a sequence of one-dimensional distributions,
which can be given by
p Mi 1 , . . . Mi p |α = p Mi p | Mi 1 . . . Mi p−1 , α p
× p Mi p−1 | Mi 1 . . . Mi p−2 , α p−1 . . . . . .
× p ( Mi 2 | Mi 1 , α2 ) p ( Mi 1 |α1 ) (4)
T
where α = α1 , α2 , . . . α p . In specification of one dimensional conditional dis-
tributions in Eq. 4, suppose we know that X = X i 1 , X i 2 , . . . , X i p conta contains
ins all
continuous predictors, then a sequence of univariate normal distributions can be
assigned to p M | M . . . M
, α . If X contains all categorical variables, say
i1 i p −1
ip
p
binary, then the guideline is to specify a sequence of logistic regressions for each
p Mi p | Mi 1 . . . Mi p−1 , α p . One can consider the sequence of multinomial distribu-
tions in the case where all variables in X have more than two levels. Similarly, for all
count variables, one can assign Poisson distributions. With missing categorical and
continuous covariates, it is recommended that the joint covariate distribution should
be assigned by first specifying the distribution of the categorical covariates condi-
tional on continuous covariates. In some special circumstances, when X i p is strictly
positive and continuous, a normal distribution can be specified on the transformed
variable log
l og ( X i p ). If log
l og ( X i p ) is not approximately normal then other specifications
such as an exponential or gamma distribution are recommended.
5.2 Simulation
A simulation study was conducted to determine the bias and mean squared error
(MSE) of estimating β with various proportions of missing covariates under a MAR
assumption. The data augmentation procedure for estimating β and MSEs was com-
pare
pa red d to a comp
comple lete
te case
case an
anal
alys
ysis
is.. In the
the si
simu
mula
lati
tion
on,, at each
each iter
iterat
atio
ion,
n, th
thre
reee cova
covari
riat
ates
es
( X 1 , X 2 , X 3 ), in which X 1 is binary and X 2 , X 3 are continuous, were generated. X 1
was generated from the binomial distribution with a success probability of 0.3, X 2
was simulated from a normal distribution with mean = 0 variance = 1 and X 3 was
generated from a normal distribution with mean = 1 and variance = 2. Furthermore,
the response variable Y was generated from N β0 + β1 x1 + β2 x2 + β3 x 3 , σ 2 = 1
with β0 = 0 and β1 = β2 = β3 = 1. The sample size for each iteration was 100.
Markov Chain Monte-Carlo Methods for Missing … 139
ex p (γ + X i 3 )
P r ( X i 1 = missing| X i 3 ) = = pi .
1 + ex p (γ + X i 3 )
covariate
covaria te distributions:
about the bias and efficiency of multiple imputation compared with complete-case
analysis for missing covariate
covariate values, see (White and Carlin 2010;
2010; Chen et al.
al. 2008).
2008).
posterior sampling and use posterior means to estimate the regression coefficients
for the covariates, reported height and participated in any physical activity in past
months. The results are reported in Table 5.
6 Disc
Discus
ussi
sion
on
In thi
thiss cha
chapte
pterr, we ha
have
ve illus
illustra
trate
ted
d how
how a fully
fully Bayesi
Bayesian
an re
regre
gressi
ssion
on modeli
modeling
ng
approach can be applied to incomplete data under an ignorability assumption. It
is well known that the observed data are not sufficient to identify the underlying
missing mechanism. Therefore, sensitivity analyses should be performed over var-
ious plausible models for the nonresponse mechanism (Little and Rubin 2014 2014).
). In
general, the stability of conclusions (inferences) over the plausible models gives
an indication of their robustness to unverifiable assumptions about the mechanism
underlying missingness. In linear regression, when missingness on Y depends on
the fully observed Xs (MAR), DA has negligible bias and smaller MSEs compared
to the complete-cases. When missingness in Xs depends on other Xs which are
fully observed then the CC analysis has negligible bias and very similar MSEs com-
pare
pa red
d to the
the DA fo
forr mi
miss
ssin
ing
g cova
covari
riat
ates
es.. Fu
Furt
rthe
herm
rmor
ore,
e, when
when miss
missin
ingn
gnes
esss in cova
covari
riat
ates
es
depends on the response, DA will perform better than CC. Because of these biases,
the choice of the method should come from a substantive basis. In summary, a FB
modeling approach enables coherent model estimation because missing data values
are treated as parameters, which are easily sampled within MCMC simulations. The
FB approach takes into account uncertainty about missing observations and offers
a very flexible way to handle the missing data. In conclusion we remark that for
missing covariates we have used a default class of priors to make inferences. How-
ever, for some studies, historical data may be available allowing for construction
of informative priors that may further improve inference. For more details, Ibrahim
et al.
al. (2002)
2002) proposed a class of informative priors for generalized linear models
with missing covariates.
142 H. Rochani and D.F. Linder
References
Behavioral risk factor surveillance system. Retrieved July 5, 2015, from http:// www www.cdc.gov/brfss
.cdc.gov/brfss..
Chen, Q., Ibrahim, J. G., Chen, M. -H., & Senchaudhuri, P. (2008). Theory and inference for
re
regre
gressi
ssion
on models
models with
with missin
missingg respon
responses
ses and co
cova
varia
riates
tes.. Journal of multivariate analysis, 99(6),
1302–1331.
Dani
Da niel
els,
s, M. J.,
J., & Hoga
Hogan,
n, J.
J. W. (200
(2008)
8).. Missin
Missing
g data
data in longit
longitudiudinal
nal studies
studies:: Strate
Strategie
giess for Bayesi
Bayesian
an
modeling
Etzioni, R., and
Pepe, sensitivity
M., Longton, G.,. CRC
analysis Press. H., & Goodman, Gary. (1999). Incorporating
Chengcheng,
the time dimension in receiver operating characteristic curves: A case study of prostate cancer.
Medical Decision Making, 19 (3), 242–251.
Geman,
Gem an, S., & Geman,
Geman, D. (1984)
(1984).. Stocha
Stochasti
sticc rel
relaxa
axatio
tion,
n, gibbs
gibbs distrib
distributi
utions
ons,, and the bayesi
bayesian
an restor
restora-
a-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence , 6, 721–741.
Hastin
Has tings,
gs, K. W. (1970)
(1970).. Monte-
Monte-Car
Carlolo sampli
samplingng method
methodss using
using marko
markov v chains
chains and their
their applic
applicati
ations
ons..
Biometrika, 57 (1), 97–109.
Ibrahim, J. G., Chen, M. -H., Lipsitz, & S. R., (1999a). Monte-Carlo em for missing covariates in
parametric regression models. Biometrics, 55 (2), 591–596.
Ib
Ibra
rahi
him,
m, J. G.
G.,, Lips
Lipsit
itz,
z, S. R.,
R., & Chen
Chen,, M. -H.
-H. 191999
99b
b. Miss
Missin
ing
g co
cov
varia
ariate
tess in ge
gene
nera
raliz
lized
ed line
linear
ar mode
modelsls
when
whe n the missin
missing g data
data mechan
mechanism ism is non-ig
non-ignonorab
rable.
le. Journal
Journalof
of the Royal Statistical Society: Series
B (Statistical Methodology), 61(1), 173–190.
Ibrahim, J. G., Chen, M. -H., & Lipsitz, S. R. (2002). Bayesian methods for generalized linear
models with covariates missing at random. Canadian Journal of Statistics , 30 (1), 55–78.
Lipsitz, S. R., & Ibrahim, J. G. (1996). A conditional model for incomplete covariates in parametric
regression models. Biometrika, 83(4), 916–922.
Little, R. J. A., & Rubin, D. B. (2014). Statistical analysis with missing data. Wiley.
Pepe, M. S. (2003). The statistical evaluation of medical tests for classification and prediction .
USA: Oxford University Press.
Metropolis, N., & Ulam, S. (1949). The Monte-Carlo method. Journal of the American statistical
association, 44(247), 335–341.
Rosenb
Ros enblut
luth,
h, M. N., Tell
eller
er,, A. H., & Tell
eller
er,, E. (1953)
(1953).. Equati
Equation
on of state
state calcul
calculati
ation
onss by fa
fast
st comput
computing
ing
machines. The journal of Chemical Physics , 21(6), 1087–1092.1087–1092.
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592.
Satten, G. A., & Carroll, R. J. (2000). Conditional and unconditional categorical regression models
with missing covariates. Biometrics, 56 (2), 384–388.
Schafer, J. L. (1997). Analysis of incomplete multivariate data . CRC press. press.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmen-
tation. Journal of the American statistical Association, 82(398), 528–540.
White, I. R., & Carlin, J. B. (2010). Bias and efficiency of multiple imputation compared with
complete-case analysis for missing covariate values. Statistics in Medicine, 29 (28), 2920–2931.
Xie,, F., & Paik,
Xie Paik, M. C. (1997)
(1997).. Multip
Multiple
le imputa
imputatio
tion
n method
methodss for the missin
missing
g co
cova
varia
riates
tes in genera
generaliz
lized
ed
estimating equation. Biometrics, 1538–1546.
A Multip
Multiple
le Imputat
Imputation
ion Framew
Framework
ork
for Massive Multivariate Data of Different
Variable Types: A Monte-Carlo Technique
Hakan Demirtas
1 Intr
Introd
oduc
ucti
tion
on
Missin
Miss ingg da
data
ta ar
aree a co
comm
mmon only
ly oc
occu
curr
rrin
ing
g phe
pheno
nome
meno
non
n in ma
many
ny cocont
ntex
exts
ts.. De
Dete
term
rmin
inin
ing
g
a suitable analytical approach in the presence of incomplete observations is a major
focus
focus of sci
scient
entific
ific inq
inquir
uiry
y due to the add
addit
ition
ional
al com
comple
plexit
xity
y tha
thatt ari
arises
ses thr
throug
oughh mis
missin
sing
g
data. Incompleteness generally complicates the statistical analysis in terms of biased
H. Demirtas (B)
Division of Epidemiology and Biostatistics (MC923), University of Illinois at Chicago,
1603 West Taylor Street, Chicago, IL 60612, USA
e-mail: [email protected]
continuous data, multinomial model for discrete data, general location model for a
A Multiple Imputation Framework for Massive Multivariate Data … 145
mix of normal and categorical data), commonly known as joint MI models (Schafer
1997).
1997 ). Some methods can handle all data types with relaxed assumptions, but lack
theoretical justification (Van Buuren 2012 2012;; Raghunathan et al. al. 2001
2001).). For massive
data
da ta as cocoll
llec
ecte
ted
d in RTD
TDC C st
stud
udieies,
s, a no
nove
vell im
impu
puta
tati
tion
on fr
fram
ameewo
work
rk th
that
at ca
cann ac
acco
commmmo- o-
date all four major types of variables is needed with a minimal set of assumptions.
In addition, no statistical power (probability of correctly detecting an effect) and
sample size (number of subjects, measurements, waves) procedures are av available
ailable for
RTDC data. The lack of these tools severely limits our ability to capitalize on the
full potential of incomplete intensive data, and a unified framework for simultane-
ously imputing all types of variables is necessary to adequately capture a broad set
of substantive messages that massive data are designed to convey. The proposed MI
framework represents an important and refined addition to the existing methods, and
has potential to advance scientific knowledge and research in a meaningful way; it
offers promising potential for building enhanced statistical computing infrastructure
for education and research in the sense of providing principled, useful, general, and
flexible set of computational tools for handling incomplete data.
Combin
Com bining
ing our pre
previo
vious
us ran
random
dom num
number
ber gen
genera
eratio
tion
n (RN
(RNG)G) wor
work k for mul
multitiva
varia
riate
te
ordinal data (Demirtas 2006 2006),), joint binary and normal data (Demirtas and Doganay
2012),
2012 ), ordinal and normal data (Demirtas and Yavuz 2015 2015),
), count and normal data
(Amatya and Demirtas 2015a 2015a), ), binary and nonnormal continuous data (Demirtas
et al
al.. 2012
2012;; Dem
Demirt
irtas
as 2014
2014,, 2016
2016)) wit
withh the spe
specifi
cificat
cation
ion of mar
margin
ginal
al and ass
associ
ociati
ationa
onall
parameters; our published work on MI (Demirtas 2004 2004,, 2005
2005,, 2007
2007,, 2008
2008,, 2009
2009,,
2010,, 2017a
2010 2017a;; Demirtas et al. 20072007,, 2008
2008;; Demirtas and Hedeker 2007 2007,, 2008a
2008a,, b, c;
Demirtas and Schafer 2003 2003;; Yucel and Demirtas 2010 2010),
), along with some related
work (Demirtas et al. al. 2016a
2016a;; Demirtas and Hedeker
Hedeker 2011
2011,, 2016
2016;; Emrich and Pied-
monte 1991 1991;; Fleishman 1978
1978;; Ferrari and Barbiero 2012 2012;; Headrick 2010
2010;; Yahav and
Shmuelii 2012
Shmuel 2012),), a broad mixed data imputation framework that spans all possible
combinations of binary, ordinal, count, and continuous variables, is proposed. This
system is capable of handling the overwhelming majority of continuous shapes; it
can be extended to control for higher order moments for continuous variables, and to
allow
allow ove over-
r- and under
under-disp
-dispersio
ersion n for count variables
variables as well as the specification
specification of
Spearman’s rank correlations as the measure of association. Procedural, conceptual,
operational, and algorithmic details of the published, current, and future work will
be given throughout the chapter.
The organization of the chapter is as follows: In Sect. 2, background information
is provided on the generation of multivariate binary, ordinal, and count data with
an emphasis on underlying multivariate normal data that form a basis for the subse-
quent discretiza
discretization
tion in the binary and ordinal cases, and corre
correlati
lation
on mappi
mappingng using
inverse cumulative distribution functions (cdfs) in the count data case. Then, differ-
ent correlation types that are relevant to the work are described; a linear relationship
between correlations before and after discretization is discussed; and multivariate
power polynomials in the context of generating continuous data are elaborated on.
In Sect. 3, operational characteristics of MI under normality assumption are articu-
lated. In Sect. 4, an MI algorithm for multivariate data with all four major variable
146 H. Demirtas
types is outlined under ignorable nonresponse by merging the available RNG and
MI routines. Finally, in Sect. 5, concluding remarks, future research directions, and
extensions are given.
2 Back
Backgr
grou
ound
nd on RN
RNG
G
The pr
The prop
opos
osed
ed MI al
algo
gori
rith
thm
m ha
hass st
stro
rong
ng im
impe
petu
tuss an
and
d co
comp
mpututat
atio
iona
nall ro
root
otss de
deri
rive
ved
d fr
from
om
some ideas that appeared in the RNG literature. This section provides salient fea-
tures of multivariate normal (MVN), multivariate count (MVC), multivariate binary
(MVB), and multiv
multivariate
ariate ordinal (MVO) data generation. RelevRelevant
ant correlation struc-
tures
tures in
invo
volvi
lving
ng the
these
se dif
differ
ferent
ent typ
types
es of va
varia
riable
bless are dis
discus
cussed
sed;; a con
connec
nectio
tion
n bet
betwee
ween
n
correlations before and after discretization is established; and the use of power poly-
nomials, which will be employed at later stages to accommodate nonnormal contin-
uous data, is explained.
MVN Data Generation
Generation: Sampling from the MVN distribution is straightforward.
Suppose Z ∼
Nd (µ, ), wheree µ is th
wher thee me
mean
an ve
vect
ctor
or,, an
and
d is the sym
symmet
metric
ric,, pos
positi
itive
ve
definite, d × d va
varia
riance
nce-co
-cova
varia
riance
nce mat
matrix
rix.. A ran
random
dom dra
draw
w fro
from
m a MVN dis
distri
tribut
bution
ion
can be obtained using the Cholesky decomposition of and a vector of univariate
normal dra
normal draws.
ws. The Cho
Choles
lesky
ky dec
decomp
omposi ositition
on of pro
produc
duces
es a lo
lower
wer-tr
-tria
iangul
ngular
ar mat
matrix
rix
A for whi ch A A T
which = =
. If z (z 1 , ..., z d ) are d indep
..., independen
endentt stand
standard
ard norma
normall rando
random
m
variables, then Z = +
µ Az is a random draw from the MVN distribution with mean
vector µ and covariance matrix .
MVC Data Generation: Count data have been traditionally modeled by the Poisson
distribution. Although a few multivariate
multivariate Poisson (MVP) data generation techniques
have been published, the method in Yahav and Shmueli (2012 ( 2012)) is the only one that
reasonably works (allowing negative correlations) when the number of components
is greater than two. Their method utilizes a slightly modified version of the NORTA
(Normal to Anything) approach (Nelsen 20062006),), which involves generation of MVN
variates with given univariate marginals and the correlation structure ( R N ), and then
transforming it into any desired distribution using the inverse cdf. In the Poisson
case, NORTA can be implemented by the following steps:
1. Genera te a k -dimensional normal vector Z N from M V N distribution with mean
Generate
vector 0 and a correlation matrix R N .
2. Tr
Transfor
ansformm Z N to a Poisson vector X P O I S as follows:
(a) For each
each eleme nt z i of Z N , calculate the Normal cdf,
element ((z i ).
(b) Fo
Forr each value
value of ( z i ), calculate the Poisson inverse cdf with a desired
x e−θ θi
corresponding marginal rate θi , θ−i 1 (( z i )); where θi (x )
= i =0 i ! .
T
3. X P O I S = − ((z ) ) , . . . , − ((z ))
θi
1
i θk
1
k is a draw from the desired M V P